Business Analytics Assignment - GROUP 21

Analysts:

  • An Naafi Y. Lathifah [5974658]
  • Irvan Arif [5690838]
  • Jo Lynn Tan [6055443]
  • M. Miftahul Fahmi [5972035]
  • Valdo Pratama [5702313]

Report Outline

The project will be presented in 4 main parts, starting from Project Background (contains the Business Problems), Data Preparation, Modeling, and finally Interpretation (Summary and Actionable Insights).

Part 1 - Project Background

1.1 Abstract

1.2 Business Problems

1.3 Setting Up the R Workspace

1.4 Dataset

1.5 Workflow

1.5 Reflection

Part 2 - Data Preparation

2.1 Data Exploration

2.2 Data Cleaning

2.3 Feature Engineering

2.4 Data Transformation

2.5 Initial Plotting (Result after Transformation)

2.6 Reflection

Part 3 - Data Modeling

3.1 Modeling Preparation

3.2 Classification Method

3.3 Logistic Regression Method

Part 4 - Interpretation

4.1 Result

4.2 Recommendations

4.3 Further Improvement Recommendation

4.4. Conclusion & Limitation


Part 1. Project Background

To begin with, this part explains mainly about the business problem and the objective of doing this analysis, the dataset, as well as the translation of machine-learning that we will use.

1.1. Abstract

In the rapidly expanding landscape of e-commerce, strategic marketing decision plays important role (Reyes-Gomez et al, 2024 [1]), particularly for the product sellers and wholesale vendors to penetrate new markets. As Canada’s e-commerce market were projected to soar to $140.5 billion by 2025 (International Trade Administration, 2023 [2]), predicting the pattern of how a product win the Canadian e-commerce market provides a great benefit for vendors. This research seeks to provide realistic and actionable recommendations to e-commerce vendors and third-party sellers who want to promote their products on Amazon Canada. By conducting data analysis and drawing insights from the market trend, consumer behaviour, pricing strategy, and product description, we can provide this results for these vendors and sellers.

On Amazon, there are various types of sellers and vendors who operate on the platform:

  1. Professional Sellers:
    • These are independent individuals or businesses who sell products on Amazon’s marketplace.
    • They can list products in over 20 categories, set their own prices, and manage their inventory.
    • Professional sellers pay a monthly subscription fee, along with additional fees for each item sold.
    • They have access to tools and services provided by Amazon to manage their selling operations efficiently.
  2. Third-Party Sellers:
    • Third-party sellers are a subset of professional sellers who sell products they don’t own or manufacture.
    • They may source products from wholesalers, distributors, or manufacturers and sell them on Amazon’s platform.
    • These sellers often leverage Amazon’s Fulfillment by Amazon (FBA) service, where Amazon handles storage, packaging, and shipping of their products.
    • Third-party sellers can choose to fulfill orders themselves (Merchant Fulfilled) or use Amazon’s fulfillment services.
  3. Vendors:
    • Vendors are typically manufacturers, distributors, or large brands who sell products directly to Amazon.
    • They operate under a wholesale model, supplying products in bulk to Amazon’s fulfillment centers.
    • Amazon purchases inventory directly from vendors and manages pricing, marketing, and customer service for the products.
    • Amazon’s Vendor Management team is responsible for identifying brands, products and pricing strategy to increase sales for each product department and category.
    • Vendors may have access to additional marketing and promotional opportunities on Amazon’s platform.

In summary, professional sellers are independent entities who sell products on Amazon, third-party sellers may sell products they don’t own, and vendors supply products directly to Amazon for resale. Each type of seller has its own advantages, business models, and responsibilities within the Amazon ecosystem.

The analysis conducted in this project aims to provide actionable insights into Product Marketing Strategy to be successful on Amazon Canada, tailored for the Category Management Team and different types of sellers that sell products on the platform.

Findings

When the client is a new vendor that want to reach “Best Seller” products, we would recommend them to focus on the top four departments (namely Sport & Outdoor, Automotive, Clothing, Shoes & Jewelry, and Electronics) with the highest percentage of best seller products.

The chosen models give key strategies for optimizing sales and reviews in the 4 recommended departments. In Automotive, emphasis is placed on increasing sales volume based. Clothing, Shoes & Jewelry prioritize review generation, for instant through customer engagement and QR codes. Similarly, Electronics target review accumulation using comparable tactics. In Fashion, product quality and reviews are highlighted as pivotal factors influencing purchases. Finally, Sports & Outdoors focus on brand recognition, which influences consumers towards higher-priced items.

1.2. Business Problem

In this assignment, Group 21 presents as business consultants of a Canadian e-commerce consulting firm, that helps clients design, conceptualize and implement product launch campaigns on e-commerce market in Canada. In order to expand our client base, our firm decided to expand our e-commerce portfolio to include Amazon Canada platform. We want to attract and serve e-commerce vendors that are new to the Amazon Canada market and existing Amazon Canada vendors with poor sales performance. We want to position our ability to help our clients to reaching Best-Seller ranking status in short amount of time as our primary service offering. We aim to derive insights from the top performing products to help the sellers with the low sales performance to achieve best-seller ranking on Amazon Canada Platform. We also want to identify the product categories with highest likelihood for new product to be ranked as Best-Seller quickly. So, our objective is to leverage data analytics to understand existing Amazon Canada product landscape. This will help us formulate realistic and actionable recommendations in our service offering to help our new clients reaching their business objectives.

Thus, the main business problem to be answered here is: In Amazon e-commerce Canada, What Products are recommended to Sell as Best Sellers and How to be Best Sellers in these Departments?

Why Canada?

  • Canada is one of the top 10 countries with highest number of e-commerce users [3]. Canadian consumers increasingly rely upon the Internet to place orders. For the past decade, Internet consumer sales have risen at a far higher rate than traditional retail sales [2]. The e-commerce penetration in Canada is also relatively high, reaching 75% of the total population.
  • Data Availability: Compared to the available data in Kaggle in the same platform, Canada is one of the most comprehensive available data (reaching 2.1M data rows), compared to UK and USA. [4]

Why Best Seller?

Best Selling list is one of the most well-known type of marketing. The history can be drawn be until 20 years ago, when Best Selling list (for books) are a powerful marketing tools (Miller, [5]). A study by Farzad Fathi (2023, [6]) about the effect on Best-Selling in e-commerce platform, could “secures the seller a spot in consumers’ consideration set and in turn increases consumers’ purchase likelihood”. Thus, for the new sellers, it is best to aim for the best seller to mark up its product branding.

1.3. Setting Up the R Workspace

Mainly, the following packages are used here and divided into three parts Data Cleaning and Exploring, Classification, and Log Regression.

Visualization

  • ggplot2: This package will be used for visualizing the data and analysis results. It provides great interface for creating a wide variety of plots
  • gridExtra: This package is used by us to arrange multiple grid-based plots, so it convenience to read
  • corrplot: This package enables the visualization of correlation matrices (heatmaps, scatterplots, and other graphical representations).

Data Cleaning

  • stringr: This package will be used for several tasks such as pattern matching and data cleaning
  • lattice: This package will be used for visualizing relationships in multivariate data
  • dplyr: We use this for filtering rows, selecting specific columns, rearranging rows, and summarizing data
  • fuzzyjoin: We use this for matching strings based on similarity, useful for merging datasets
  • httr: This package will be used to send requests to websites to get data, we use this so the lecturers do not need to download the additional data (so the main data is still one single file)
  • magick: This package will be used for creating an image visualization in R.

Decision Tree Analysis (Classification)

  • C50: This package is utilized for building decision trees
  • rsample: With this package, we can perform resampling methods like cross-validation
  • gmodels: This package provides functions for visualizing and summarizing model fits
  • randomForestExplainer: Used for interpreting and explaining random forest models - party: Utilized for fitting and visualizing recursive partitioning trees, it offers algorithms for tree-based modeling
  • randomForest: This package is employed for constructing random forest models, to improve the decision tree accuracy.

Logistic Regression

  • caret: This package is used in the process of building predictive models in R.
  • effects: This package offers functions for plotting marginal effects, interaction effects, and other model effects.
  • pROC: We can visualize and analyze receiver operating characteristic (ROC) curves, calculate area under the curve (AUC) values, and other metrics for evaluating binary classifiers.
  • textreg: This package is used for fitting regression models to textual data.
  • coefplot: Used for visualizing coefficient estimates from regression models.
  • vip: We can calculate and visualize variable importance in predictive models.
  • visreg: Utilized for visualizing the effects of predictor variables in regression models.
  • texreg: Create publication-quality tables of regression model output in R.
  • effect: Visualizes predicted values and effects of model variables from various types of regression models.

PLEASE NOTICE that this project requires an installation for certain library. Thus, installing the library is one of the most important step. When opening this RMD file, a notification (yellow) on top of this page will appear, suggesting to install the libraries. Please click “install”, or you can run the code below.

To show the chunks of the library, please refer to Library in the RMD Files.

1.4. Dataset

In this assignment, the dataset that will be used is Amazon Canada Products 2023, a dataset from Kaggle website. Initially, the dataset consists of 11 variables and 2,1 million products collected through a web scraping process in 2023. The dataset includes unique descriptive data of quantitative and qualitative nature, which needs to be cleaned and prepared before data analysis.

PLEASE NOTICE to Mr Sander and Anggi: we already include the dataset in the same zip file as this file (around 600 MB in size). Please make sure to unzip the RMD and the csv file in the same folder. If you find any problem in running our code, please contact and give Group 21 chance to solve this.

Although we have already made sure that this RMD file can be provenly run in some computers.

Firstly, we will import all the csv files to be used in this data analysis process. Now, we will load csv file for Amazon Canada into dataset_init dataframe.

# please note that the csv needs to be put into the same folders as this file!
# the dataset should be downloaded from the Kaggle, or just unzip from our submission in BrightSpace.
# the dataset name is "dataset initial" (dataset_init)
dataset_init <- read.csv("amz_ca_total_products_data_processed.csv")

Secondly, we have also manually created the product category and product department hierarchy so that we could reduce the number of categories (266) to department (19) in a long table format, saved in a csv file.

We will import the csv file here through https link since we store it in Dropbox, therefore, people don’t have to save the csv file in their internal storage yet still can access the dataset.

We will only add them into the dataset at the Part 2 of this notebook.

# Dropbox shareable link (make sure it ends with dl=1)
url <- "https://www.dropbox.com/scl/fi/6uhwc7u4o3swoudriayrw/categories_departments.csv?rlkey=fznvh6rs3ddtpwovdq0k0b7dg&dl=1"

# Download the data from Dropbox
response <- GET(url)
if (status_code(response) == 200) {
  # Read the content
  content_data <- content(response, type = "text", encoding = "UTF-8")
  
  # Read the CSV data, and put them into our data frame
  categories <- read.csv(text = content_data)
  head(categories)
} else {
  paste("Failed to retrieve data. Status code:", status_code(response), "!!! Please contact Group 21: m.m.fahmi@student.tudelft.nl !!! ")
}
##                        categoryName              department
## 1            3D Printing & Scanning Industrial & Scientific
## 2     Abrasive & Finishing Products Industrial & Scientific
## 3 Action Figures, Maquettes & Busts            Toys & Games
## 4                     Action Sports       Sports & Outdoors
## 5            Air Freshener Supplies      Health & Household
## 6            Arts & Crafts Supplies   Arts, Crafts & Sewing

Data dictionary

Based on Kaggle, here is the definition of each of the variables (or columns):

Field name Data type Description
asin Character Product ID from Amazon.ca
title Character Title of the product
imgURL Character URL of the product image
productURL Character URL of the product
stars Dbl Rating of the product. If no rating is available, it is represented as 0
reviews Integer Number of reviews for the product. If no reviews are available, it is represented as 0
price Dbl Current price of the product. If the price is unavailable, it is represented as 0
listPrice Dbl Original price of the product before any discounts. If no list price is available, it is represented as 0
categoryName Character Name of the product category
isBestSeller Character Indicates if the product is labeled as a best seller
boughtInLastMonth Integer Amount of product that was bought in the last month

1.5. Workflow: Machine-Learning Task Translation

The research seeks to answer “what variables or factors that are influencing a product to be a best seller” and vice versa. Therefore, both classification and predictive modelling approach (log regression) will be used to investigate this further. It is also important to create some dummy variables to get more richer data and analysis.

Run the code to see how the workflow that is used in this research process:

file_path <- "BA_workflow_Group21.png"
knitr::include_graphics(file_path)

1.6. Reflection and Limitation

In our analysis, we aim to predict the factors influencing a product’s success as a best seller on Amazon. From the dataset provided, we hypothesize that variables such as product price, total reviews, and star ratings play significant roles in determining a product’s success as a best seller. We anticipate that products with higher ratings and more reviews are likely to perform better in terms of sales.

Given the extensive size of the Amazon Canada dataset, comprising 2 million rows, we have chosen to focus solely on this dataset for our analysis. This decision aligns with our objective of predicting the likelihood of products becoming best sellers on Amazon. However, such a large dataset may also contain errors and data quality issues, underscoring the importance of thorough exploration and data preparation in our analysis.

To facilitate our analysis, we plan to segment the dataset into various categories and departments. This segmentation will enable us to classify the data effectively and derive meaningful insights. To achieve this, we have developed a department dataset based on information obtained from the Amazon website. This dataset comprises 19 departments, which we will use to categorize our main dataset accordingly. This approach will help us group similar products and analyze their performance across different departments.


Part 2 - Data Exploration and Preparation

2.1. Data Exploration

In data exploration, we want to know the data that we want to work with. Here we can use relevant R function and data visualizations to explore the data and variables.

We utilize the colnames() function to retrieve the list of variables available in the dataset, as demonstrated below:

colnames(dataset_init)
##  [1] "asin"              "title"             "imgUrl"           
##  [4] "productURL"        "stars"             "reviews"          
##  [7] "price"             "listPrice"         "categoryName"     
## [10] "isBestSeller"      "boughtInLastMonth"

To gain an overview of the dataset, we opt for the glimpse() function instead of str() for a more concise output. Given the detailed information and large dataset, using str() yields overwhelming results. Additionally, we employ the summary() function to display basic descriptive statistics per variable, including mean, minimum, maximum, and quartiles.

print("Glimpse Output: ")
## [1] "Glimpse Output: "
glimpse(dataset_init)
## Rows: 2,165,926
## Columns: 11
## $ asin              <chr> "B07CV4L6HX", "B09N1HGY74", "B087P7538J", "B0822FF7Y…
## $ title             <chr> "Green Leaf WW3D Wonder Extension Cord Winder, Gray,…
## $ imgUrl            <chr> "https://m.media-amazon.com/images/I/81cRe0AVC4L._AC…
## $ productURL        <chr> "https://www.amazon.ca/dp/B07CV4L6HX", "https://www.…
## $ stars             <dbl> 4.4, 3.8, 4.0, 4.5, 4.2, 4.5, 4.3, 4.0, 4.5, 4.4, 4.…
## $ reviews           <int> 2876, 55, 126, 1936, 46, 2505, 216, 53, 164, 366, 87…
## $ price             <dbl> 47.69, 10.99, 25.99, 21.99, 18.99, 15.99, 27.99, 9.9…
## $ listPrice         <dbl> 0.00, 0.00, 27.99, 30.99, 0.00, 0.00, 0.00, 0.00, 0.…
## $ categoryName      <chr> "Industrial  Scientific", "Industrial  Scientific", …
## $ isBestSeller      <chr> "False", "False", "False", "False", "False", "False"…
## $ boughtInLastMonth <int> 0, 100, 50, 100, 100, 0, 50, 50, 0, 50, 50, 100, 50,…
print("Summary Output: ")
## [1] "Summary Output: "
summary(dataset_init)
##      asin              title              imgUrl           productURL       
##  Length:2165926     Length:2165926     Length:2165926     Length:2165926    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      stars          reviews             price            listPrice       
##  Min.   :0.000   Min.   :     0.0   Min.   :    0.00   Min.   :   0.000  
##  1st Qu.:0.000   1st Qu.:     0.0   1st Qu.:   15.42   1st Qu.:   0.000  
##  Median :4.000   Median :     5.0   Median :   27.42   Median :   0.000  
##  Mean   :2.624   Mean   :   545.7   Mean   :  111.22   Mean   :   4.651  
##  3rd Qu.:4.500   3rd Qu.:   123.0   3rd Qu.:   57.50   3rd Qu.:   0.000  
##  Max.   :5.000   Max.   :868865.0   Max.   :40900.00   Max.   : 999.990  
##  categoryName       isBestSeller       boughtInLastMonth  
##  Length:2165926     Length:2165926     Min.   :    0.000  
##  Class :character   Class :character   1st Qu.:    0.000  
##  Mode  :character   Mode  :character   Median :    0.000  
##                                        Mean   :    9.005  
##                                        3rd Qu.:    0.000  
##                                        Max.   :20000.000

We can see that we have a huge dataset, containing more than 2 million rows! This dataset was collected through a web scraping process in 2023. This dataset provides valuable insights into product titles, pricing, ratings, and more of product in Amazon Canada, as explained in the colnames() ad the summaries above.

Exploring numerical variables

Based on the summary, we have 5 variables with numerical data type. Further, every numerical variable will be explored in this section. First, we want to compare the mean and standard deviation for numeric variables to assess magnitude of outliers for each variable.

Variable: Stars

summary(dataset_init$stars)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   4.000   2.624   4.500   5.000
paste("St Dev: ",sd(dataset_init$stars))
## [1] "St Dev:  2.14990454724761"

Variable: Reviews

summary(dataset_init$reviews)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.0      0.0      5.0    545.7    123.0 868865.0
paste("St Dev: ", sd(dataset_init$reviews))
## [1] "St Dev:  4355.22473914881"

Variable: Price

summary(dataset_init$price)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00    15.42    27.42   111.22    57.50 40900.00
paste("St Dev: ", sd(dataset_init$price))
## [1] "St Dev:  497.665280323457"

Variable: ListPrice

summary(dataset_init$listPrice)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.000    0.000    0.000    4.651    0.000  999.990
paste("St Dev: ", sd(dataset_init$listPrice))
## [1] "St Dev:  29.8439227961412"

Variable: Bought in Last Month

summary(dataset_init$boughtInLastMonth)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     0.000     0.000     0.000     9.005     0.000 20000.000
paste("St Dev: ", sd(dataset_init$boughtInLastMonth))
## [1] "St Dev:  98.3826458869313"

Overall, by comparing standard deviation and mean values of each variables, we observed the skewness in the distributions of these variables that indicates potential biases or irregularities in the data. While the variability in each variable provides valuable insights into product performance and market dynamics, we want to confirm the skewness by visualizing them so that we can take actions to deal with the potential lack of data reliability later on.

Now, we want to see the distribution of those variables using histogram.

# Create faceted histograms
columns <- c("stars", "reviews", "price", "listPrice", "boughtInLastMonth")

# Create a list to store the plots
plots <- list()

# Create histograms for each variable
for (col in columns) {
  plot <- ggplot(dataset_init, aes(x = !!sym(col))) +
    geom_histogram() +
    ggtitle(paste(col, "Distribution")) +
    theme(plot.title = element_text(hjust = 0.5, size = unit(10, "mm")), 
          axis.text.x = element_text(angle = 45, hjust = 1))
  
  plots[[length(plots) + 1]] <- plot
}

# Arrange histograms in a grid
grid.arrange(grobs = plots, ncol = 3, width = 30, height = 20)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that there is significant skewness in some columns. So we look at the data that are skewing each variable. Based on the result and visualization above, we can conclude that:

  1. Stars:

    • The distribution of star ratings appears to be negatively skewed, as evidenced by the mean (2.624) being lower than the median (4.000).
    • The wide range of star ratings (from 0 to 5) suggests that there is variability in customer satisfaction levels, but the skewness indicates that there may be more lower ratings than higher ones.
    • However, the reliability of the star ratings may be questionable, as the standard deviation (2.150) indicates considerable variability in ratings across products.
  2. Reviews:

    • The distribution of reviews is highly right-skewed, with a median of 5 and a mean of 545.7, suggesting that a few products receive a disproportionately high number of reviews.
    • The wide range of review counts (from 0 to 868865) indicates significant variability in customer feedback, but the skewness suggests that most products receive relatively few reviews.
    • The reliability of review counts may be affected by outliers or highly popular products that skew the distribution.
  3. Price:

    • The distribution of prices appears to be positively skewed, with a median of 27.42 and a mean of 111.22, indicating that there may be more lower-priced products than higher-priced ones.
    • The wide range of prices (from 0 to 40900.00) suggests considerable variability in pricing strategies among products, but the skewness indicates that most products are priced relatively low.
    • However, the reliability of price data may be influenced by outliers or extreme values.
  4. List Price:

    • Similar to the price variable, the distribution of list prices is positively skewed, with a median of 0.000 and a mean of 4.651, suggesting that most products have relatively low listed prices.
    • The wide range of list prices (from 0 to 999.990) indicates variability in pricing strategies, but the skewness suggests that lower list prices are more common.
    • As with price, the reliability of list price data may be impacted by outliers or inconsistencies in pricing information.
  5. Bought in Last Month:

    • The distribution of purchases made in the last month also appears to be positively skewed, with a median of 0.000 and a mean of 9.005, indicating that most products have relatively low recent sales volumes.
    • The wide range of purchase counts (from 0 to 20000.000) suggests variability in product popularity, but the skewness suggests that fewer products have high recent sales.
    • The reliability of purchase data may be influenced by outliers or seasonal trends affecting sales volumes.

To ensure integrity of the dataset, we chose to keep these data but apply log transformation later on section 2.5.

# Create faceted boxplots
columns <- c("stars", "reviews", "price", "listPrice", "boughtInLastMonth")

# Initialize a list to store plots
plots <- list()

# Create boxplots for each variable
for (col in columns) {
  p <- ggplot(dataset_init, aes(y = .data[[col]])) +
    geom_boxplot() +
    ggtitle(paste("Boxplot for", col)) +
    theme(plot.title = element_text(hjust = 0.5, size = unit(10, "mm"))
    )
  
  plots[[length(plots) + 1]] <- p
}

# Arrange the plots in a 2 by 3 grid
grid.arrange(grobs = plots, ncol = 3, padding = unit(5, "lines"))

To explore the outliers in every variable, we use the boxplot to identify it. It is shown that:

  1. Star
  • the white box is very elongated showing that all the items in stars is nearly have the same ratings between 4-5. We also have a lack of a visible lower and upper whisker, suggesting that there’s a little variability on the data. Moreover, the boxplot doesn’t show any dots that indicates outliers.
  1. Reviews
  • the box appears to be a line at the buttom of the plot, suggests that we might have a very low median, 1st quartile, and 3rd quartile. There are reviews counts that are much higher than the rest and don’t fit the general pattern of the data (outliers), showing that while most items have very few reviews, there are a few items with a very hight number of reviews. The distribution with many low values and few very high values could be meant that product popularity might varies greatly.
  1. Price
  • the box appears to be a line at the buttom of the plot, suggests that we might have a very low median, 1st quartile, and 3rd quartile. There are outliers that indicate prices that are significantly higher than the rest of the data—much higher than what is considered the “normal” range of prices. +Overall, this boxplot suggests that while most items are priced within a relatively narrow range at the lower end of the scale, there are several items with prices that are much higher than the rest, possibly luxury items or those with premium pricing. This can be considered as common pattern in many markets where a few items are much more expensive than the majority.
  1. List Price
  • similar with the previous plots, the box appears to be a line at the buttom. The upper whisker extends significatly upwards, means there are List Prices that are higher than the median and the quartiles but not so high as to be considered outliers. Moreover, since there is an absence of individual dots which implies that there are no outliers at the hight end of the List Prices. The concentration of prices at the lower end might indicate standard pricing for a common product or service, while the higher prices could belong to more premium or less common items within the same category.
  1. Bought In Last Month
  • similar with the previous plots, the box appears to be a line at the buttom. The plot shows only the upper whisker extending upwards, indicating there are some items that were bought more frequently than most others, but still not in extremely high quantities. There are outliers shown in ths boxplot, indicating more items were bought significantly than the rest. This distribution suggests that most of the items have low sales figures, with a few exceptions that sell a lot more. This could be considered of a few popular items or ones on promotion that drive more sales compared to other items that sell less frequently.

Next, we will identify Zero Value. Specifically, we will calculate the composition of zero value in each of the numerical variables. Zero values could indicate the possibility to do further calculation, because if the value is zero or “NA”, we could not operate them.

zero_percentages <- sapply(dataset_init[, c("stars", "reviews", "price", "listPrice", "boughtInLastMonth")], function(x) {
  sum(x == 0, na.rm = TRUE) / length(x) * 100
})
zero_percentages
##             stars           reviews             price         listPrice 
##         39.129407         39.088224          8.214039         92.305877 
## boughtInLastMonth 
##         94.949273

Based on the zero percentages calculated for each variables above:

  1. Stars: Approximately 39.13% of the observations have a value of zero in the “stars” variable. This could indicate that a significant portion of products in this dataset may not have received any star ratings, possibly suggesting new or less popular products.

  2. Reviews: Similarly, around 39.09% of the observations have zero reviews.This suggests that a large portion of products in this dataset may not have received any reviews, indicating either new products or products with low customer engagement.

  3. Price: About 8.21% of the observations in this dataset have a price value of zero. This could be indicative of products with missing or incomplete price information, or it could represent products that are offered for free.

  4. List Price: A substantial portion in this dataset, approximately 92.31%, of the observations have a list price of zero. This might imply that many products do not have a listed price, which could be due to various reasons such as promotional items, bundled products, or missing data.

  5. Bought in Last Month: The variable “boughtInLastMonth” has the highest percentage of zeros, with around 94.95% of observations in this dataset having a value of zero. This suggests that the majority of products may not have been purchased in the last month, indicating potentially low sales or limited recent activity.

The analysis of zero percentages highlights potential areas of interest and further investigation. It shows the importance of understanding the distribution and prevalence of zero values in the dataset to draw meaningful insights and make informed decisions.

generate_pie_chart <- function(zeros_condition, title, dataset) {
    if (missing(dataset)) {
        stop("Dataset argument is missing.")
    }
    # Calculate the percentage of products with conditions met
    percentage_zeros <- round(sum(zeros_condition) / nrow(dataset) * 100, 2)
    # Calculate the percentage of products where conditions are not met
    percentage_non_zeros <- round(100 - percentage_zeros, 2)
    # Create a data frame for the pie chart
    pie_data <- data.frame(Category = c("Zeros", "Non-Zeros"), Percentage = c(percentage_zeros, percentage_non_zeros))
    # Create the pie chart
    pie(pie_data$Percentage, labels = paste(pie_data$Category, ": ", pie_data$Percentage, "%"),
        main = title, cex.main = 0.8)  # Adjust the font size as needed
    # Return pie_data for later use if needed
    return(pie_data)
}

# Generate pie chart for "stars variable"
pie_data_stars <- generate_pie_chart(dataset_init$stars == 0 & dataset_init$reviews == 0, 
                    "% of products with both stars and reviews equal to 0", dataset_init)

# Generate pie chart for "price variable"
pie_data_price <- generate_pie_chart(dataset_init$reviews == 0 & dataset_init$price == 0 & dataset_init$listPrice == 0 & dataset_init$boughtInLastMonth == 0, 
                    "% of Products with both reviews, price, listPrice, boughtInLastMonth equal to 0", dataset_init)

# Generate pie chart for "listPrice variable"
pie_data_list_price <- generate_pie_chart(dataset_init$reviews == 0 & dataset_init$price == 0 & dataset_init$listPrice == 0 & dataset_init$boughtInLastMonth == 0, 
                    "% of Products with both reviews, price, listPrice, boughtInLastMonth equal to 0", dataset_init)

# Generate pie chart for "boughtInLastMonth variable"
pie_data_bought_in_last_month <- generate_pie_chart(dataset_init$reviews == 0 & dataset_init$price == 0 & dataset_init$listPrice == 0 & dataset_init$boughtInLastMonth == 0, 
                    "% of Products with both reviews, price, listPrice, boughtInLastMonth equal to 0", dataset_init)

# Generate pie chart for all variables except "stars"
pie_data_no_stars <- generate_pie_chart(dataset_init$reviews == 0 & dataset_init$price == 0 & dataset_init$listPrice == 0 & dataset_init$boughtInLastMonth == 0, 
                    "% of Products with both reviews, price, listPrice, boughtInLastMonth equal to 0", dataset_init)

Observing on the Pie Chart above, we suspect that the products with zero reviews may also have values as ‘0’ in price, listPrice, boughtInLastMonth variables. So we extract the rows where these variables contain values as “0”, bind them into a subset and find the duplicates in the subsets.

# First, we need to subset on those variables
subset_stars_zero <- which(dataset_init$stars == 0)
subset_reviews_zero <- which(dataset_init$reviews == 0)
subset_price_zero <- which(dataset_init$price == 0)
subset_listPrice_zero <- which(dataset_init$listPrice == 0)
subset_boughtInLastMonth_zero <- which(dataset_init$boughtInLastMonth == 0)

# Check for duplicates among the subsets
duplicated_rows <- unique(c(subset_reviews_zero, subset_price_zero, subset_listPrice_zero, subset_boughtInLastMonth_zero))
paste("Number of Product with values 0: ",length(duplicated_rows))
## [1] "Number of Product with values 0:  2142926"
paste("Equal to Percentage: ",length(duplicated_rows)/nrow(dataset_init) * 100)
## [1] "Equal to Percentage:  98.9380985315288"

So we confirm that 214926 products with values as 0 in the column reviews, price, listPrice, ‘boughtInLastMonth’. In order to make sure that our data analysis focuses on product data with the most recent sales history and recent activity, we will remove these product without signals of activity from the dataset during Data Cleaning phase.

Exploring non-numerical variables

From the data frame explained above, we have 6 non-numerical (categorical) variables. Those are asin, title, imgUrl, productURL, categoryName, and isBestSeller.

First, we are gonna see the count number and percentage of every group of isBestSeller variable.

# Count the unique values inside the "isBestSeller" column
isBestSeller_Count <- table(dataset_init$isBestSeller)

# Show in percentages, for easier imagination
names(isBestSeller_Count) <- c("not best seller(%)", "best seller(%)")
( isBestSeller_Count/nrow(dataset_init)) * 100
## not best seller(%)     best seller(%) 
##         99.6466177          0.3533823

Next, we will group products by category, calculates the count and percentage of best sellers in each category, sorts them in descending order of percentage, and outputs the updated dataframe. It helps analyze which product categories have the highest percentage of best sellers. We can see that ‘Home & Kitchen’, ‘Tools Home Improvement’, ‘Sports Outdoors’, ‘Clothing, Shoes Jewellery’, ‘Industrial Scientific’ categories have the highest percentage of products with Best Seller ranking.

# Group by categoryName and calculate the count of best sellers
category <- dataset_init %>%
  group_by(categoryName) %>%
  summarise(isBestSeller_Count = sum(isBestSeller == "True"),
            Product_Count = n(),
            Percentage_BestSeller = round((isBestSeller_Count/Product_Count)*100, 2))
category <- category[order(category$Percentage_BestSeller, decreasing = TRUE), ]

# Output the updated data frame
category
## # A tibble: 266 × 4
##    categoryName           isBestSeller_Count Product_Count Percentage_BestSeller
##    <chr>                               <int>         <int>                 <dbl>
##  1 Home  Kitchen                         426          3519                 12.1 
##  2 Tools  Home Improveme…                167          3021                  5.53
##  3 Sports  Outdoors                      351          6716                  5.23
##  4 Clothing, Shoes  Jewe…                260          5123                  5.08
##  5 Industrial  Scientific                227          5330                  4.26
##  6 Automotive                            125          3421                  3.65
##  7 Health  Personal Care                  72          2285                  3.15
##  8 Toys  Games                           176          5604                  3.14
##  9 Electronics                           459         16784                  2.73
## 10 Patio, Lawn  Garden                   239          9200                  2.6 
## # ℹ 256 more rows

The following horizontal bar plot compares the product count and best seller count across the top 10 product categories with highest percentage of best seller products. It visualizes the distribution of products and their best-selling status, aiding in identifying categories with high product counts and their corresponding best seller counts for strategic decision-making. It could indicate that these are the categories that are more likely to achieve best-seller ranking.

# Select top 10 categories with highest Percentage_BestSeller
top_categories <- head(category, 10)
par(mar = c(5, 15, 4, 4) + 0.5)

# Plot Product_Count
barplot(height = top_categories$Product_Count, names.arg = top_categories$categoryName, 
        ylab = "", xlab = "Product Count",
        main = "Product Count by Category in Top 10 Categories",
        col = "skyblue",  
        horiz = TRUE, 
        las = 1,
        cex.axis = 0.8,  # Adjust size of x-axis labels        
        cex.names = 0.8, # Adjust size of y-axis labels
        cex.lab = 1,    # Adjust size of axis titles
        xlim = c(0, max(top_categories$Product_Count) * 1.2), # Set xlim to extend for BestSeller_Count bars
        add = FALSE)  # Ensure it's not adding to existing plot

# Add bars for BestSeller_Count
barplot(height = top_categories$isBestSeller_Count, names.arg = rep("", nrow(top_categories)), 
        col = "orange",  
        horiz = TRUE,
        add = TRUE,
        axes = FALSE)  # Turn off axes

# Add legend and axis title
legend("bottomright", legend = c("Product Count", "Best Seller Count"), fill = c("skyblue", "orange"), 
       bty = "y",  # No box around legend
       inset = 0.05)  # Adjust legend position
mtext(text = "Category Name", side = 2, line = 9, cex = 1)

Exploring association between variables

We want to further explore the associations between the different pair of variables. We omitted categoryName from this exploration because it contains 266 categorical values. We will explore this when we have reduced it to department later in this process.

# Scatter plot for (stars, price)
plot(na.omit(dataset_init$stars), na.omit(dataset_init$price),
     xlab = "Stars",
     ylab = "Price",
     main = "Association between Stars and Price",
     pch = 16 # Use solid circles as points
)

# Scatter plot for (stars, reviews)
plot(na.omit(dataset_init$stars), na.omit(dataset_init$reviews),
     xlab = "Stars",
     ylab = "Reviews",
     main = "Association between Stars and Reviews",
     pch = 16 # Use solid circles as points
)

# Create a contingency table ( isBestSeller, boughtInLastMonth)
table(dataset_init$boughtInLastMonth, dataset_init$isBestSeller)
##        
##           False    True
##   0     2053515    3016
##   50      51790     935
##   100     28466    1009
##   200      9740     577
##   300      4582     362
##   400      2728     277
##   500      1739     192
##   600      1186     151
##   700       790     122
##   800       708     105
##   900       479      86
##   1000     1879     474
##   2000      394     143
##   3000      147      80
##   4000       56      41
##   5000       28      27
##   6000       13      17
##   7000       12       6
##   8000        9       7
##   9000        2       7
##   10000       8      18
##   20000       1       2

The following scatter plots illustrate a positive association between star ratings and both product reviews and prices. Products with more reviews tend to have higher star ratings, while the price distribution across different star ratings remains relatively even. This suggests that higher-rated products garner more attention and potentially higher prices.

The contingency table above also displays the count of products categorized by their isBestSeller status and the number of purchases (boughtInLastMonth). It provides insights into the distribution of products across various purchase levels within each isBestSeller category, aiding in understanding their association. The contingency table reveals that a majority of products are not best sellers, with over 2 million products having zero purchases. Additionally, few products achieve high purchase levels, indicating a concentration of sales among a small portion of the product pool.

2.2. Data Cleaning and Preparation

Dealing with outliers

The earlier data exploration steps confirmed that the products with numerical variables containing values of 0 significantly overlap with each other. Since we want to base our analysis on product data with recent sales activity, we will remove these impacted rows.

# Remove rows where listPrice is not available
dataset_init <- dataset_init[dataset_init$listPrice != 0.00, ] 

# Sort by price to confirm if products with $0,00 in listPrice colummn are removed
dataset_init <- dataset_init[order(dataset_init$listPrice, decreasing = TRUE), ] 

# Remove rows where no sales were made last month
paste("Number of Rows that will be deleted, because no sales were made last month",sum(dataset_init$boughtInLastMonth == 0))
## [1] "Number of Rows that will be deleted, because no sales were made last month 143556"
dataset_init <- dataset_init[dataset_init$boughtInLastMonth != 0.00, ] 

# Remove products that did not receive any reviews
dataset_init <- dataset_init[dataset_init$reviews != 0.00, ] 

paste("Finally, the final dataset contains rows: ", nrow(dataset_init))
## [1] "Finally, the final dataset contains rows:  23000"

Dealing with NA and inconsistent data

To check NA value, use is.na.

sum(is.na(dataset_init))
## [1] 0

It is clear that there is no NA value. We dealt a lot about dealing with “0” values above.

Next: We want to explore approaches that can make these category names more useful. Thus, we move to another next step, which is finding hidden delimiter in the categoryName variable. This is important to keep the consistency of the names, and because classification needs to have clear categories name.

# Find the indices of entries in categoryName with whitespace
indices_with_whitespace <- grep("\\s", dataset_init$categoryName)
categories_with_whitespace <- dataset_init[indices_with_whitespace, ]
length(unique(categories_with_whitespace$categoryName))
## [1] 162
sort(unique(categories_with_whitespace$categoryName))
##   [1] "3D Printing  Scanning"                   
##   [2] "Action Figures, Maquettes  Busts"        
##   [3] "Action Sports"                           
##   [4] "Arts  Crafts Supplies"                   
##   [5] "Automotive Care"                         
##   [6] "Automotive Exterior Accessories"         
##   [7] "Automotive Interior Accessories"         
##   [8] "Automotive Replacement Parts"            
##   [9] "Automotive Tires  Wheels"                
##  [10] "Automotive Tools  Equipment"             
##  [11] "Baby  Child Care Products"               
##  [12] "Baby  Toddler Toys"                      
##  [13] "Baby Strollers"                          
##  [14] "Bath  Body"                              
##  [15] "Bath Products"                           
##  [16] "Beauty Tools  Accessories"               
##  [17] "Bikes, Scooters  Ride-Ons"               
##  [18] "Boating  Watersports"                    
##  [19] "Breakfast Cereal"                        
##  [20] "Building  Construction Toys"             
##  [21] "Building Supplies"                       
##  [22] "Camera  Photo"                           
##  [23] "Camping  Hiking Equipment"               
##  [24] "Car Electronics  Accessories"            
##  [25] "Cat Supplies"                            
##  [26] "Child Safety Car Seats"                  
##  [27] "Clothing, Shoes  Jewellery"              
##  [28] "Coffee, Tea  Espresso"                   
##  [29] "Collectible Toys"                        
##  [30] "Computer Accessories"                    
##  [31] "Computer Audio  Video Accessories"       
##  [32] "Computer Components"                     
##  [33] "Cutting Tools"                           
##  [34] "Cycling Equipment"                       
##  [35] "Diet  Nutrition Products"                
##  [36] "Dishwashing Supplies"                    
##  [37] "Dog Supplies"                            
##  [38] "Dolls  Accessories"                      
##  [39] "Electrical Equipment"                    
##  [40] "Exercise  Fitness Equipment"             
##  [41] "Farming  Urban Agriculture"              
##  [42] "Food Service Equipment  Supplies"        
##  [43] "Fresh Flowers  Indoor Plants"            
##  [44] "Game Hardware"                           
##  [45] "Games  Accessories"                      
##  [46] "Garden Structures  Germination Equipment"
##  [47] "Golf Equipment"                          
##  [48] "Hair Care"                               
##  [49] "Hand Tools"                              
##  [50] "Handmade Home Décor"                     
##  [51] "Health  Personal Care"                   
##  [52] "Health Care Products"                    
##  [53] "Heating Cooling  Air Quality"            
##  [54] "Home  Kitchen"                           
##  [55] "Home  Portable Audio"                    
##  [56] "Home Brewing  Wine Making"               
##  [57] "Home Décor"                              
##  [58] "Home Storage  Organization"              
##  [59] "Home Textiles"                           
##  [60] "Household Batteries"                     
##  [61] "Household Cleaning"                      
##  [62] "Household Cleaning Tools"                
##  [63] "Household Supplies"                      
##  [64] "Hunting  Fishing"                        
##  [65] "Hydraulics, Pneumatics  Plumbing"        
##  [66] "Industrial  Scientific"                  
##  [67] "Industrial Materials"                    
##  [68] "International Food Market"               
##  [69] "Irons, Steamers  Accessories"            
##  [70] "Janitorial  Sanitation Supplies"         
##  [71] "Kids' Play Tents  Tunnels"               
##  [72] "Kitchen  Bath Fixtures"                  
##  [73] "Kitchen  Dining"                         
##  [74] "Kitchen Cookware"                        
##  [75] "Kitchen Knives  Cutlery Accessories"     
##  [76] "Kitchen Storage  Organization"           
##  [77] "Kitchen Utensils  Gadgets"               
##  [78] "Lab  Scientific Products"                
##  [79] "Laptop  Netbook Computer Accessories"    
##  [80] "Large Appliances"                        
##  [81] "Laundry Supplies"                        
##  [82] "Leisure Sports  Game Room"               
##  [83] "Luggage  Travel Gear"                    
##  [84] "Martial Arts  Combat Sports"             
##  [85] "Material Handling"                       
##  [86] "Material Transport Equipment"            
##  [87] "Medical Supplies  Equipment"             
##  [88] "Men's Accessories"                       
##  [89] "Men's Clothing"                          
##  [90] "Men's Jewelry"                           
##  [91] "Men's Shoes"                             
##  [92] "Men's Watches"                           
##  [93] "Motorcycle Accessories  Parts"           
##  [94] "Musical Instruments, Stage  Studio"      
##  [95] "Nail Polish  Nail Decoration Products"   
##  [96] "Nails, Screws  Fasteners"                
##  [97] "Novelty  Special Use Clothing"           
##  [98] "Nursery Furniture, Bedding  Décor"       
##  [99] "Occupational Health  Safety Products"    
## [100] "Office Products"                         
## [101] "Oils  Fluids"                            
## [102] "Oral Hygiene Products"                   
## [103] "Outdoor Cooking"                         
## [104] "Outdoor Décor"                           
## [105] "Outdoor Gear"                            
## [106] "Outdoor Lighting Products"               
## [107] "Outdoor Play Toys"                       
## [108] "Outdoor Power  Lawn Equipment"           
## [109] "Outdoor Recreation Apparel  Equipment"   
## [110] "Paint, Body  Trim Products"              
## [111] "Paper  Plastic Household Supplies"       
## [112] "Patio Furniture  Accessories"            
## [113] "Patio, Lawn  Garden"                     
## [114] "Perfume  Cologne"                        
## [115] "Pet Supplies"                            
## [116] "Plants Seeds  Bulbs"                     
## [117] "Pools, Hot Tubs  Supplies"               
## [118] "Power  Hand Tools"                       
## [119] "Power Tools  Hand Tools"                 
## [120] "Printer Accessories"                     
## [121] "Professional Medical Supplies"           
## [122] "RV Parts  Accessories"                   
## [123] "Salon  Spa Equipment"                    
## [124] "Sandboxes  Beach Toys"                   
## [125] "Science Education Supplies"              
## [126] "Sewing, Craft  Hobby"                    
## [127] "Sex  Sensuality Products"                
## [128] "Shaving  Hair Removal Products"          
## [129] "Shoe, Jewelry  Watch Accessories"        
## [130] "Skin Care Products"                      
## [131] "Small Appliances"                        
## [132] "Snow  Ice Sports"                        
## [133] "Sport Specific Clothing"                 
## [134] "Sporting Apparel"                        
## [135] "Sports  Outdoors"                        
## [136] "Sports Fan Shop"                         
## [137] "Stationery  Party Supplies"              
## [138] "Stuffed  Plush Animals"                  
## [139] "Swimming Pool  Outdoor Water Toys"       
## [140] "Tarps  Tie-Downs"                        
## [141] "Team Sports"                             
## [142] "Televisions  Video"                      
## [143] "Test, Measure  Inspect"                  
## [144] "Tools  Home Improvement"                 
## [145] "Toy Sports Equipment"                    
## [146] "Toy Vehicles"                            
## [147] "Toys  Games"                             
## [148] "Uniforms, Work  Safety"                  
## [149] "USB Gadgets"                             
## [150] "Vacuums  Floor Care"                     
## [151] "Vehicle Electronics"                     
## [152] "Vision Care Products"                    
## [153] "Vitamins, Minerals  Supplements"         
## [154] "Water Coolers, Filters  Cartridges"      
## [155] "Weather Thermometers"                    
## [156] "Women's Accessories"                     
## [157] "Women's Clothing"                        
## [158] "Women's Handbags"                        
## [159] "Women's Health  Family Planning"         
## [160] "Women's Jewelry"                         
## [161] "Women's Shoes"                           
## [162] "Women's Watches"

We noticed ” ” (whitespace) in 162 category names. Furthermore, we tried 2 approaches, including:

  1. We explored the option to consider the ” ” as delimiter that split the string into top category and subcategory. The outcome wasn’t meaningful.

  2. After manually comparing with category Names on amazon website, the category names with ” ” are missing “&”. All ” ” are then replaced with ” & ”

dataset_init$categoryName <- gsub("  ", " & ", dataset_init$categoryName)

2.3. Feature Engineering

Feature engineering is a crucial step in the process of building machine learning models that will be done in this research. It involves creating new features or modifying existing features to improve the performance of a model.

Encoding non-numeric variables

We start by encoding non-numerical variables. The ‘isBestSeller’ values are encoded from ‘True’ to ‘1’ and ‘False’ to ‘0’.

# isBestSeller --> numeric values
dataset_init$isBestSeller <- ifelse(dataset_init$isBestSeller == "True", 1, 0)

We then check for the dimentionality details of the dataset. Please notice the change after the code above

glimpse(dataset_init)
## Rows: 23,000
## Columns: 11
## $ asin              <chr> "B08J3Y9SYR", "B00CH9QWOU", "B07WXZDHGV", "B0183D35S…
## $ title             <chr> "GOTRAX EBE3 27.5inch Electric Bike with 48V 10Ah Re…
## $ imgUrl            <chr> "https://m.media-amazon.com/images/I/71eamY5Gp+L._AC…
## $ productURL        <chr> "https://www.amazon.ca/dp/B08J3Y9SYR", "https://www.…
## $ stars             <dbl> 4.3, 4.5, 4.3, 4.4, 3.9, 4.1, 4.6, 4.5, 4.6, 4.1, 4.…
## $ reviews           <int> 349, 22226, 3718, 189, 210, 3184, 4142, 7651, 269, 2…
## $ price             <dbl> 849.97, 799.97, 740.98, 397.03, 799.99, 666.98, 699.…
## $ listPrice         <dbl> 999.99, 999.99, 999.00, 899.99, 899.99, 899.99, 899.…
## $ categoryName      <chr> "Cycling Equipment", "Home & Kitchen", "Automotive T…
## $ isBestSeller      <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ boughtInLastMonth <int> 50, 400, 400, 200, 50, 50, 200, 50, 100, 3000, 50, 2…

Deriving new variables

We made a mapping file based on categories mapping file based on Amazon’s department/category hierarchy. We now add another column “department” by merging the categories dataframe. As a result, we want to see the amount of products for each department and each category. The horizontal barplots show that ‘Beauty’ ‘Grocery & Gourmet Food’,‘Home & Kitchen’, ‘Baby Products’ and ‘Electronics’ contains most number of categories and number of products. This indicates that these five are the most popular categories for vendors, with significant higher level of competition, which could be driven by the potential of profitability in this segment.

Why is it necessary Department is broader categorization of products in Amazon. While we just got “categories” (from Kaggle initial dataset, by default) which is too deep and will not give insights. Furthermore, we are positioning ourselves as consultant that will answer our business problem about where should you sell, to be specific, in what department should you focus. Thus, it is wiser to start from broader view.

We also create new variable called hasDiscount_dummy. This explains the discount status of every product sold in Amazon Canada. By generating this variable, we tried to make product categorization for product with discount and without discount.

# Create a new column "hasDiscount_dummy"
dataset_init$hasDiscount_dummy <- NA

# Set values based on the logic
dataset_init$hasDiscount_dummy[dataset_init$listPrice == 0] <- 0
dataset_init$hasDiscount_dummy[dataset_init$listPrice > dataset_init$price] <- 1
head(dataset_init)
##               asin
## 651901  B08J3Y9SYR
## 727943  B00CH9QWOU
## 481491  B07WXZDHGV
## 729325  B0183D35S4
## 1324647 B0B2K6PMPK
## 1804174 B0BSFW26YK
##                                                                                                                                                                                                          title
## 651901  GOTRAX EBE3 27.5inch Electric Bike with 48V 10Ah Removable Lithium-Ion Battery, 500W Powerful Motor up Speed 32km/h, Shimano Professional 21 Speed Gears,Dual Disc Brakes Alloy Frame Electric Bicycle
## 727943                                                                                                                            Breville Barista Express Espresso Machine, Brushed Stainless Steel, BES870XL
## 481491                                                    ChargePoint Home Flex Level 2 WiFi Enabled 240 Volt NEMA 6-50 Plug Electric Vehicle EV Charger for Plug in or Hardwired Indoor Outdoor Setup w/Cable
## 729325                                                                                                                         CUISINART MCP-12NCC MultiClad Pro Stainless Steel 12-Piece Cookware Set, Silver
## 1324647                               EverCross EV10K PRO App-Enabled Electric Scooter, Scooter Adults with 500W Motor, Up to 19 MPH & 22 Miles E-Scooter, Lightweight Folding for 10'' Honeycomb Tires, Black
## 1804174             ECOVACS Deebot N10+ Robot Vacuum and Mop Cleaner, Self Emptying Robotic Vacuum, 3800Pa Suction, Laser Based LiDAR Navigation, Carpet Detection, Multi Floor Mapping, Personalized Cleaning
##                                                                 imgUrl
## 651901  https://m.media-amazon.com/images/I/71eamY5Gp+L._AC_UL320_.jpg
## 727943  https://m.media-amazon.com/images/I/71BvCt6eAFL._AC_UY218_.jpg
## 481491  https://m.media-amazon.com/images/I/51kxX+QE+-L._AC_UL320_.jpg
## 729325  https://m.media-amazon.com/images/I/81RRGy2PHhL._AC_UY218_.jpg
## 1324647 https://m.media-amazon.com/images/I/61HkiF3RUDL._AC_UL320_.jpg
## 1804174 https://m.media-amazon.com/images/I/61SjFdho74L._AC_UY218_.jpg
##                                  productURL stars reviews  price listPrice
## 651901  https://www.amazon.ca/dp/B08J3Y9SYR   4.3     349 849.97    999.99
## 727943  https://www.amazon.ca/dp/B00CH9QWOU   4.5   22226 799.97    999.99
## 481491  https://www.amazon.ca/dp/B07WXZDHGV   4.3    3718 740.98    999.00
## 729325  https://www.amazon.ca/dp/B0183D35S4   4.4     189 397.03    899.99
## 1324647 https://www.amazon.ca/dp/B0B2K6PMPK   3.9     210 799.99    899.99
## 1804174 https://www.amazon.ca/dp/B0BSFW26YK   4.1    3184 666.98    899.99
##                         categoryName isBestSeller boughtInLastMonth
## 651901             Cycling Equipment            0                50
## 727943                Home & Kitchen            1               400
## 481491  Automotive Tools & Equipment            0               400
## 729325                Home & Kitchen            0               200
## 1324647                 Toys & Games            0                50
## 1804174         Vacuums & Floor Care            0                50
##         hasDiscount_dummy
## 651901                  1
## 727943                  1
## 481491                  1
## 729325                  1
## 1324647                 1
## 1804174                 1

Since this dataset contains limited number of variables, we want to extract further insights by creating new variables that track amount of discount, discount percentage, product title length, and revenue.

  • Revenue is gained by multiplying Price x BoughtInLastMonth
  • Discount is simply price and last month price (original price)
  • Title Length: we found that title is the first property that the user seeks for. Thus, in this research, it is hypothesized that the more concise the title, it will be more eye-catching. [Based on this reference: https://www.convertcart.com/blog/ecommerce-product-title-optimization].
# Create a new column "discountAmount"
dataset_init$discountAmount <- ifelse(dataset_init$hasDiscount_dummy == 1, abs(dataset_init$price - dataset_init$listPrice), 0)

dataset_init$discountPercentage <- round(dataset_init$discountAmount / dataset_init$listPrice * 100)

# Replace NaN values with discountPercentage
dataset_init$discountPercentage[is.nan(dataset_init$discountPercentage)] <- 0

# Convert to integer
dataset_init$discountPercentage <- as.integer(dataset_init$discountPercentage)

# Create New Variable title length
dataset_init$titleLength <- nchar(dataset_init$title)

head(dataset_init)
##               asin
## 651901  B08J3Y9SYR
## 727943  B00CH9QWOU
## 481491  B07WXZDHGV
## 729325  B0183D35S4
## 1324647 B0B2K6PMPK
## 1804174 B0BSFW26YK
##                                                                                                                                                                                                          title
## 651901  GOTRAX EBE3 27.5inch Electric Bike with 48V 10Ah Removable Lithium-Ion Battery, 500W Powerful Motor up Speed 32km/h, Shimano Professional 21 Speed Gears,Dual Disc Brakes Alloy Frame Electric Bicycle
## 727943                                                                                                                            Breville Barista Express Espresso Machine, Brushed Stainless Steel, BES870XL
## 481491                                                    ChargePoint Home Flex Level 2 WiFi Enabled 240 Volt NEMA 6-50 Plug Electric Vehicle EV Charger for Plug in or Hardwired Indoor Outdoor Setup w/Cable
## 729325                                                                                                                         CUISINART MCP-12NCC MultiClad Pro Stainless Steel 12-Piece Cookware Set, Silver
## 1324647                               EverCross EV10K PRO App-Enabled Electric Scooter, Scooter Adults with 500W Motor, Up to 19 MPH & 22 Miles E-Scooter, Lightweight Folding for 10'' Honeycomb Tires, Black
## 1804174             ECOVACS Deebot N10+ Robot Vacuum and Mop Cleaner, Self Emptying Robotic Vacuum, 3800Pa Suction, Laser Based LiDAR Navigation, Carpet Detection, Multi Floor Mapping, Personalized Cleaning
##                                                                 imgUrl
## 651901  https://m.media-amazon.com/images/I/71eamY5Gp+L._AC_UL320_.jpg
## 727943  https://m.media-amazon.com/images/I/71BvCt6eAFL._AC_UY218_.jpg
## 481491  https://m.media-amazon.com/images/I/51kxX+QE+-L._AC_UL320_.jpg
## 729325  https://m.media-amazon.com/images/I/81RRGy2PHhL._AC_UY218_.jpg
## 1324647 https://m.media-amazon.com/images/I/61HkiF3RUDL._AC_UL320_.jpg
## 1804174 https://m.media-amazon.com/images/I/61SjFdho74L._AC_UY218_.jpg
##                                  productURL stars reviews  price listPrice
## 651901  https://www.amazon.ca/dp/B08J3Y9SYR   4.3     349 849.97    999.99
## 727943  https://www.amazon.ca/dp/B00CH9QWOU   4.5   22226 799.97    999.99
## 481491  https://www.amazon.ca/dp/B07WXZDHGV   4.3    3718 740.98    999.00
## 729325  https://www.amazon.ca/dp/B0183D35S4   4.4     189 397.03    899.99
## 1324647 https://www.amazon.ca/dp/B0B2K6PMPK   3.9     210 799.99    899.99
## 1804174 https://www.amazon.ca/dp/B0BSFW26YK   4.1    3184 666.98    899.99
##                         categoryName isBestSeller boughtInLastMonth
## 651901             Cycling Equipment            0                50
## 727943                Home & Kitchen            1               400
## 481491  Automotive Tools & Equipment            0               400
## 729325                Home & Kitchen            0               200
## 1324647                 Toys & Games            0                50
## 1804174         Vacuums & Floor Care            0                50
##         hasDiscount_dummy discountAmount discountPercentage titleLength
## 651901                  1         150.02                 15         198
## 727943                  1         200.02                 20          76
## 481491                  1         258.02                 26         148
## 729325                  1         502.96                 56          79
## 1324647                 1         100.00                 11         168
## 1804174                 1         233.01                 26         186

Lastly, as part of non linear process, by using summary() sd(), we can see the skewness in the new varibles created. We will then have to adjust them during log transformation step later on.

summary(dataset_init)
##      asin              title              imgUrl           productURL       
##  Length:23000       Length:23000       Length:23000       Length:23000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      stars          reviews           price          listPrice      
##  Min.   :1.000   Min.   :     1   Min.   :  0.01   Min.   :   2.77  
##  1st Qu.:4.200   1st Qu.:   168   1st Qu.: 14.19   1st Qu.:  17.35  
##  Median :4.400   Median :   798   Median : 21.99   Median :  26.99  
##  Mean   :4.382   Mean   :  3941   Mean   : 35.63   Mean   :  43.25  
##  3rd Qu.:4.600   3rd Qu.:  3233   3rd Qu.: 36.97   3rd Qu.:  44.97  
##  Max.   :5.000   Max.   :453379   Max.   :849.97   Max.   : 999.99  
##  categoryName        isBestSeller     boughtInLastMonth hasDiscount_dummy
##  Length:23000       Min.   :0.00000   Min.   :   50.0   Min.   :1        
##  Class :character   1st Qu.:0.00000   1st Qu.:   50.0   1st Qu.:1        
##  Mode  :character   Median :0.00000   Median :  100.0   Median :1        
##                     Mean   :0.06087   Mean   :  232.6   Mean   :1        
##                     3rd Qu.:0.00000   3rd Qu.:  200.0   3rd Qu.:1        
##                     Max.   :1.00000   Max.   :20000.0   Max.   :1        
##  discountAmount    discountPercentage  titleLength   
##  Min.   :  0.280   Min.   :  1.00     Min.   :  3.0  
##  1st Qu.:  2.000   1st Qu.:  9.00     1st Qu.: 82.0  
##  Median :  4.000   Median : 15.00     Median :134.0  
##  Mean   :  7.620   Mean   : 17.38     Mean   :128.7  
##  3rd Qu.:  8.092   3rd Qu.: 23.00     3rd Qu.:177.0  
##  Max.   :502.960   Max.   :100.00     Max.   :382.0
print("The standard deviation calculation")
## [1] "The standard deviation calculation"
variables <- c("reviews", "price", "listPrice", "isBestSeller", "boughtInLastMonth", "hasDiscount_dummy", "discountAmount", "discountPercentage", "titleLength")
for (variable in variables) {
    # Print the variable name and its standard deviation
    cat("Variable:", variable, "= ", sd(dataset_init[[variable]]), "\n")
}
## Variable: reviews =  10802.74 
## Variable: price =  48.83101 
## Variable: listPrice =  59.43955 
## Variable: isBestSeller =  0.2390961 
## Variable: boughtInLastMonth =  537.7035 
## Variable: hasDiscount_dummy =  0 
## Variable: discountAmount =  13.53978 
## Variable: discountPercentage =  10.49428 
## Variable: titleLength =  53.47302
rm(variable, variables)

2.4. Data Tranformation

To address the right-skewness of the continuous variables, log transformation will be applied. Log transformation is a common technique used to handle right-skewed distributions by reducing the magnitude of larger values and spreading out smaller values. By applying log transformation, we aim to make the distribution of continuous variables more symmetric, which can improve the performance of the statistical models that we will use in our modelling phase later

  1. Classification Tree: Log transformation can help balance the distribution of predictor variables, making splits more equitable and leading to better decision boundaries. This can potentially result in more accurate predictions by reducing bias towards dominant classes.

  2. Logistic Regression: Log transformation can mitigate the influence of outliers and improve the linearity assumption between predictor variables and the log odds of the response variable. This can enhance the interpretability and stability of the logistic regression coefficients, leading to more reliable estimates of the probability of class membership.

Next, we focus on numeric variables that have shown skewness in the earlier histograms. Title length and categorical/binary variables would be excluded from this process.

# Create faceted histograms
columns <- c( "price", "boughtInLastMonth", "reviews", "discountAmount", "discountPercentage")

# Create a list to store the plots
plots <- list()

# Create histograms for each variable
for (col in columns) {
  plot <- ggplot(dataset_init, aes(x = !!sym(col))) +
    geom_histogram() +
    ggtitle(paste(col, "Distribution")) +
    theme(plot.title = element_text(hjust = 0.5, size = unit(10, "mm")), 
          axis.text.x = element_text(angle = 45, hjust = 1))

  
  plots[[length(plots) + 1]] <- plot
}

# Arrange histograms in a grid
grid.arrange(grobs = plots, ncol = 3, width = 30, height = 20)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

After the log transformation, we visualize the dataset in histograms to confirm that the skewness of these variables have been successfully addressed.

# Apply log transformation and save to new variables
dataset_init$reviews_log <- log(1 + dataset_init$reviews)
dataset_init$price_log <- log(1 + dataset_init$price)
dataset_init$boughtInLastMonth_log <- log(1 + dataset_init$boughtInLastMonth)
dataset_init$discountAmount_log <- log(1 + dataset_init$discountAmount)
dataset_init$discountPercentage_log <- log(1 + dataset_init$discountPercentage)

# Create faceted histograms
columns <- c("reviews_log", "price_log", "boughtInLastMonth_log", "discountAmount_log","discountPercentage_log")

# Create a list to store the plots
plots <- list()

# Create histograms for each variable
for (col in columns) {
  plot <- ggplot(dataset_init, aes(x = !!sym(col))) +
    geom_histogram() +
    ggtitle(paste(col, "Distribution")) +
    theme(plot.title = element_text(hjust = 0.5, size = unit(8, "mm")), 
          axis.text.x = element_text(angle = 45, hjust = 1))
  
  plots[[length(plots) + 1]] <- plot
}

# Arrange histograms in a grid
grid.arrange(grobs = plots, ncol = 3,  width = 30, height = 20)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

No normalization or standardization is needed for binary categorical variables. So ‘hasDiscount_dummy’ and ‘isBestSeller’ do not need to be normalized or standardized.

Data normalization is useful when working with algorithms sensitive to feature scaling (such as k-nearest neighbours or neural networks). Normalizing will make these algorithms converge faster. Normalization is also useful when features have different units or scales or when you need values in a bounded interval such as [0, 1] or another specific range.

Based on the above reason, the following variables will be normalized using min-max scaling approach. discountAmount discountPercentage price stars listPrice reviews boughtInLastMonth titleLength

# Define a function to perform min-max scaling
min_max_scaling <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

# Apply min-max scaling to the specified variables and add them to the original dataframe
dataset_init <- dataset_init %>%
  mutate(discountAmount_log_normalized = min_max_scaling(discountAmount_log),
         discountPercentage_log_normalized = min_max_scaling(discountPercentage_log),
         price_log_normalized = min_max_scaling(price_log),
         stars_normalized = min_max_scaling(stars),
         reviews_log_normalized = min_max_scaling(reviews_log),
         boughtInLastMonth_log_normalized = min_max_scaling(boughtInLastMonth_log),
         titleLength_normalized = min_max_scaling(titleLength))

# Visualize the normalized variables to confirm they're on the same scale.

columns <- c("discountAmount_log_normalized", "discountPercentage_log_normalized", "price_log_normalized", "stars_normalized", "reviews_log_normalized", "boughtInLastMonth_log_normalized", "titleLength_normalized")

# Create a list to store the plots
plots <- list()

# Create histograms for each variable
for (col in columns) {
  plot <- ggplot(dataset_init, aes(x = !!sym(col))) +
    geom_histogram() +
    ggtitle(paste(col, "Distribution")) +
    theme(plot.title = element_text(hjust = 0.5, size = unit(8, "mm")))
  
  plots[[length(plots) + 1]] <- plot
}

# Arrange histograms in a grid
grid.arrange(grobs = plots, ncol = 3, width = 30, height = 20)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

At this point, we have created quite a few new variables. Based on the following list of column names, we will choose the variables that can be used in our classification tree and logistic regression models.

2.5. Initial Plotting

Just to remind again, the goal of the research is to give answer to “where should sellers focus on (department)” and “how to be best seller”. To make it crystal clear, let us examine the number of Best Seller in each of the category.

dataset_init <- merge(dataset_init, categories, by.x = "categoryName", by.y = "categoryName", all.x=TRUE)

# Calculate the number of bestsellers in each category
bestseller_counts <- dataset_init %>%
  filter(isBestSeller == 1) %>%
  group_by(department) %>%
  summarise(num_bestsellers = n())

# Calculate the number of products in each category
product_counts <- dataset_init %>%
  group_by(department) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) 

# Reorder departments based on the number of products or bestsellers, whichever you prefer
bestseller_counts <- bestseller_counts %>%
  arrange(desc(num_bestsellers))

# Create the plot for number of bestsellers and number of products together
plot1 <- ggplot(bestseller_counts, aes(x = num_bestsellers, y = reorder(department, num_bestsellers))) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(x = "Number of Bestsellers", y = "Department", title = "# Bestsellers per Department") +
  theme(plot.title = element_text(size = rel(0.8), hjust = 0.5))

plot2 <- ggplot(product_counts, aes(x = count, y = reorder(department, count))) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(x = "Number of Products", y = "", title = "# Products per Department") +
  theme(plot.title = element_text(size = rel(0.8), hjust = 0.5))

# Arrange plots 1 and 2 together
grid.arrange(plot1, plot2, nrow = 1)

# Now, create the plot for the highest percentage of bestsellers
# Calculate the number of bestsellers and total items in each department
department_counts <- dataset_init %>%
  group_by(department) %>%
  summarise(total_items = n(),
  best_sellers = sum(isBestSeller == 1))

# Calculate the percentage of bestsellers in each department
department_counts <- department_counts %>%
  mutate(percentage_best_sellers = (best_sellers / total_items) * 100)

# Rank departments based on the highest percentage of bestsellers
department_counts <- department_counts %>%
  arrange(desc(percentage_best_sellers))

# Create the plot for the highest percentage of bestsellers
plot3 <- ggplot(department_counts, aes(x = percentage_best_sellers, y = reorder(department, percentage_best_sellers))) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(x = "Percentage of Bestsellers", y = "", title = "Highest Percentage of Bestsellers per Department") +
  theme(plot.title = element_text(size = rel(0.8), hjust = 0.5))

# Display plot3 separately
plot3

# Clean up for professors, to keep everything tidy up and only focus on important variables
rm(min_max_scaling,generate_pie_chart,indices_with_whitespace,plot,plot1,plot2,plot3,plots)

2.6. Reflections

Through the critical phase in Data exploration and Preparation, we could conclude that exploring and understanding each variable is an important step thus we could determine the challenges and the next strategy to overcome it. Moreover, generating clean and useful dataset is important to generate reliable input for the following modelling activities. Activities, such as removing the 0 or na values, deleting unimportant categories, renaming data categories etc, are necessary in generating the final dataset. The final data dictionary resulted from data cleaning and transforming activities is shown below.

Field name Data type Description
categoryName Character Name of the product category
asin Character Product ID from Amazon.ca
title Character Title of the product
imgURL Character URL of the product image
productURL Character URL of the product
stars Dbl Rating of the product. If no rating is available, it is represented as 0
reviews Integer Number of reviews for the product. If no reviews are available, it is represented as 0
price Dbl Current price of the product. If the price is unavailable, it is represented as 0
listPrice Dbl Original price of the product before any discounts. If no list price is available, it is represented as 0
categoryName Character Name of the product category
isBestSeller Character Indicates if the product is labeled as a best seller
boughtInLastMonth Integer Amount of product that was bought in the last month
department Character Department under which the product is categorized
hasDiscount_dummy Dbl Valur 0-1 whether the product has discount or not
discountAmount Dbl Amount of discount in $
discountPercentage Integer Percentage of the discount applied in the product
titleLength Integer Length of title words
reviews_log Dbl Logarithm of the number of reviews
price_log Dbl Logarithm of the current price
boughtInLastMonth_log Dbl Logarithm of the quantity sold in the previous month
discountAmount_log Dbl Logarithm of the discount amount
discountPercentage_log Dbl Logarithm of the percentage of discount amount
discountAmount_log_normalized Dbl Normalized value of the logarithm of the discount amount
discountPercentage_log_normalized Dbl Normalized value of the logarithm of the discount percentage
price_log_normalized Dbl Normalized value of the logarithm of the proce
stars_normalized Dbl Normalized value of the logarithm of the stars score
reviews_log_normalized Dbl Normalized value of the logarithm of the number of review
boughtInLastMonth_log_normalized Dbl Normalized value of the logarithm of the amount of product bought in last month
titleLength_normalized Dbl Normalized value of the logarithm of the length of title

Part 3 - Data Modeling

Finally, since the dataset is ready to be modelled, we can assign it to a more intuitive name. In this part, we will use two modelling techniques to reach the goal of this analytics:

  1. Classification using Decision Tree Methods
  2. Logistic Regression
df_model <- dataset_init

3.1. Modeling preparation

First, we need to do correlation analysis.

As a second data exploration step, we check the correlations between the variables of interest. This step helps detect highly correlated variables that we can cluster upfront to ease interpreting the clusters generated by the subsequent cluster analysis. This is done using the apa.cor.table-command from the apaTables package.

columns <- c("stars", "reviews", "price", "listPrice", "boughtInLastMonth", "discountAmount", "discountPercentage", "titleLength")

apaTables::apa.cor.table(df_model[ columns ])
## 
## 
## Means, standard deviations, and correlations with confidence intervals
##  
## 
##   Variable              M       SD       1            2            3           
##   1. stars              4.38    0.33                                           
##                                                                                
##   2. reviews            3940.67 10802.74 .05**                                 
##                                          [.04, .06]                            
##                                                                                
##   3. price              35.63   48.83    .02**        -.00                     
##                                          [.01, .03]   [-.01, .01]              
##                                                                                
##   4. listPrice          43.25   59.44    .02**        .00          .99**       
##                                          [.00, .03]   [-.01, .02]  [.99, .99]  
##                                                                                
##   5. boughtInLastMonth  232.64  537.70   .09**        .22**        -.04**      
##                                          [.08, .10]   [.21, .24]   [-.06, -.03]
##                                                                                
##   6. discountAmount     7.62    13.54    -.00         .02*         .73**       
##                                          [-.01, .01]  [.00, .03]   [.72, .74]  
##                                                                                
##   7. discountPercentage 17.38   10.49    -.03**       .05**        -.07**      
##                                          [-.05, -.02] [.04, .06]   [-.09, -.06]
##                                                                                
##   8. titleLength        128.73  53.47    -.12**       -.04**       .05**       
##                                          [-.13, -.10] [-.06, -.03] [.04, .07]  
##                                                                                
##   4            5            6          7           
##                                                    
##                                                    
##                                                    
##                                                    
##                                                    
##                                                    
##                                                    
##                                                    
##                                                    
##                                                    
##                                                    
##   -.04**                                           
##   [-.05, -.03]                                     
##                                                    
##   .83**        -.01*                               
##   [.82, .83]   [-.03, -.00]                        
##                                                    
##   .02**        .06**        .35**                  
##   [.00, .03]   [.05, .07]   [.33, .36]             
##                                                    
##   .05**        .00          .02**      -.05**      
##   [.04, .06]   [-.01, .02]  [.01, .03] [-.07, -.04]
##                                                    
## 
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations 
## that could have caused the sample correlation (Cumming, 2014).
##  * indicates p < .05. ** indicates p < .01.
## 

To make it more visually, we will use corrplot function instead. Please note that here, “X” signed means that correlation is not statistically significant. Furthermore, the detail about correlation will be explained in modeling part. Here, in this section, we can clearly see that almost all of the variables are linearly correlated, altough most of them are having low correlation. In addition, discountAmount and price-related variables are of course highly correlated because the discount are derived from price.

# Subset the DataFrame with the specified columns
subset_df <- df_model[, columns]

# Calculate correlation matrix
correlation_matrix <- cor(subset_df)

# Create the correlation matrix with p-values
res <- cor.mtest(subset_df)
corrplot(correlation_matrix, method = "color", type = "upper", 
         addCoef.col = "black", tl.col = "black",
         tl.srt = 45, tl.cex = 0.7, order = "hclust", 
         diag = FALSE, outline = TRUE, 
         title = "Correlation Heatmap", p.mat = res$p)

3.2. Classification Method

Second, We will do our first modeling: Classification Method

Reason in choosing this method:

  1. Data is already labelled as best seller or not best seller. Hence, supervised machine learning is the appropriate method, since the definition of Supervised Machine Learning is a category of machine learning that uses labeled datasets to train algorithms to predict outcomes and recognize patterns [7].

  2. The business problems aim to understand how Amazon chooses its Best Selling products, hence new product developer can conduct the best strategy in releasing products that will hit the market hardly. Decision Tree classification method is appropriate because it aims to predict target class (in this case: Best Selling or Not). In addition, Decision Tree classification not only predicts whether a product will be a best seller but also provides a clear decision-making process that stakeholders can interpret. In other words, we can see which variables have the most weigh in determining the target class.

  3. Looking at the correlation matrix above, almost all of the variables have low correlation. Thus, it is likely that some variables may not directly contribute to a product’s success on Amazon. Decision Tree classification can effectively handle such noise. This is because Decision Tree robustly considering all available features, including potential outliers [8], hence discern patterns that distinguish best-selling products from others, thereby providing accurate predictions despite the presence of noisy data.

Steps in doing the Decision Tree Classification:

Step 1: Modeling Preparation

Since we have done two transformation (log transform and normalization), we will have three models: - Base df (without any variable transformation). This is the benefit of using Classification method, because the scale of each variables are commonly not distort the model result. In contrast, scale of variables will affect the weight (gradient/ coefficient), thus results in bias.

  • Log df (with Log Transformation) - Norm df (with Normalized variables)

Output: isBestSelling: yes or not (Positive Outcome = Yes).

Input: All hypothetically-important variables will be included as predictors: stars, reviews, price, listPrice, isBestSeller, boughtInLastMonth, department, discountPercentage, titleLength, revenue.

To make sure that our model will not affect the original file, we will assign it to a new dataframe. We will name it df_cl, meaning “dataframe for classification modeling”.

df_cl <- df_model

Next, we will check the type of each of the variables of df_model:

str(df_cl)
## 'data.frame':    23000 obs. of  28 variables:
##  $ categoryName                     : chr  "3D Printing & Scanning" "3D Printing & Scanning" "3D Printing & Scanning" "3D Printing & Scanning" ...
##  $ asin                             : chr  "B07TRPPGT7" "B0C46564WT" "B07PGY2JP1" "B07PGYHYV8" ...
##  $ title                            : chr  "DURAMIC 3D PETG Printer Filament 1.75mm Black, 3D Printing Filament 1kg Spool(2.2lbs), Non-Tangling Non-Cloggin"| __truncated__ "Black PLA+ 3D Printer Filament,1.75mm Toughness Enhanced PLA Pus 3D Print Black Filament 1kg Spool (2.2lbs) Hig"| __truncated__ "Overture PLA Filament 1.75mm 3D Printer Filament, 1kg Spool (2.2lbs), Dimensional Accuracy +/- 0.03 mm, Fit Mos"| __truncated__ "OVERTURE PETG Filament 1.75mm, 3D Printer Filament, 1kg Filament (2.2lbs), Dimensional Accuracy 99% Probability"| __truncated__ ...
##  $ imgUrl                           : chr  "https://m.media-amazon.com/images/I/71CG+9lz63L._AC_UL320_.jpg" "https://m.media-amazon.com/images/I/81fnJdJWH8L._AC_UL320_.jpg" "https://m.media-amazon.com/images/I/81SrXc6oG3L._AC_UL320_.jpg" "https://m.media-amazon.com/images/I/81VJZgogW9L._AC_UL320_.jpg" ...
##  $ productURL                       : chr  "https://www.amazon.ca/dp/B07TRPPGT7" "https://www.amazon.ca/dp/B0C46564WT" "https://www.amazon.ca/dp/B07PGY2JP1" "https://www.amazon.ca/dp/B07PGYHYV8" ...
##  $ stars                            : num  4.3 4.4 4.3 4.3 4.4 4.2 4.2 4.6 3.9 4.5 ...
##  $ reviews                          : int  2237 86 19621 11236 1394 39 7084 130 1101 265 ...
##  $ price                            : num  27 16.1 27 27 32 ...
##  $ listPrice                        : num  36 23 31 31 38.4 ...
##  $ isBestSeller                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ boughtInLastMonth                : int  50 100 300 400 100 50 100 50 100 50 ...
##  $ hasDiscount_dummy                : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ discountAmount                   : num  9 6.92 4 4 6.39 8 1 2 5 106 ...
##  $ discountPercentage               : int  25 30 13 13 17 40 5 5 11 14 ...
##  $ titleLength                      : int  159 193 132 162 179 164 161 118 207 150 ...
##  $ reviews_log                      : num  7.71 4.47 9.88 9.33 7.24 ...
##  $ price_log                        : num  3.33 2.84 3.33 3.33 3.5 ...
##  $ boughtInLastMonth_log            : num  3.93 4.62 5.71 5.99 4.62 ...
##  $ discountAmount_log               : num  2.3 2.07 1.61 1.61 2 ...
##  $ discountPercentage_log           : num  3.26 3.43 2.64 2.64 2.89 ...
##  $ discountAmount_log_normalized    : num  0.344 0.305 0.228 0.228 0.293 ...
##  $ discountPercentage_log_normalized: num  0.654 0.699 0.496 0.496 0.56 ...
##  $ price_log_normalized             : num  0.493 0.42 0.493 0.493 0.518 ...
##  $ stars_normalized                 : num  0.825 0.85 0.825 0.825 0.85 0.8 0.8 0.9 0.725 0.875 ...
##  $ reviews_log_normalized           : num  0.569 0.306 0.745 0.7 0.531 ...
##  $ boughtInLastMonth_log_normalized : num  0 0.114 0.297 0.345 0.114 ...
##  $ titleLength_normalized           : num  0.412 0.501 0.34 0.42 0.464 ...
##  $ department                       : chr  "Industrial & Scientific" "Industrial & Scientific" "Industrial & Scientific" "Industrial & Scientific" ...

We can see clearly that mostly, the predictors are numeric. We can leave it out without transform it to factor. However, the categorical variables need to be transformed into factor, such as department and isBestSeller.

#First, since isBestSeller is numeric boolean 0 and 1, we will change it to "no" and "yes" respectively.
df_cl$isBestSeller <- as.character(df_cl$isBestSeller)
df_cl$isBestSeller <- ifelse(df_cl$isBestSeller == 1, "yes", "no")
df_cl$department <- gsub("&| |,", "", df_cl$department)

Then, do the Factor transformation on both of the variables as we have explained.

vrs <- c("department", "isBestSeller")
df_cl[ vrs ] <- lapply(df_cl[ vrs ], factor)
remove(vrs)

Finally, we will prepare in each of the forms of df (Base, Log, and Norm)

Why this is important We already did two kinds of transformation: LOG and NORM. Theoretically, it will not affect the output of the Classification, because the transformation is to reduce the skewness and Classification is already quite robust againts the outliers and skewness (Cieslak (2012), [9]).

As part of non linear process: we already tried to run the effect on three classification: without transformed variables, with log transformed, with norm transformed and the classifications yield the same.

We will define each of the dataset that will be used as the main datasource in the modeling.

column_for_base <- c("isBestSeller", "department", "stars", "reviews", "price", "boughtInLastMonth", "discountPercentage", "titleLength")
column_for_log <- c("isBestSeller", "department", "stars", "reviews_log", "price_log", "boughtInLastMonth_log", "discountPercentage_log", "titleLength")
column_for_norm <- c("isBestSeller", "department", "stars_normalized", "reviews_log_normalized", "price_log_normalized","boughtInLastMonth_log", "discountPercentage_log_normalized", "titleLength_normalized")

# Create the subset of the dataframe
df_cl_base <- df_cl[column_for_base]
df_cl_log <- df_cl[column_for_log]
df_cl_norm <- df_cl[column_for_norm]

Important Note: For base = dr_cl_base (without transformation)

For log = df_cl_log,

For norm = dr_cl_norm.

Step 2: Training and Testing

create_train_test_split <- function(data, prop) {
  set.seed(46748717)
  
  split <- rsample::initial_split(data, prop = 0.7)
  training_set <- training(split)
  testing_set <- testing(split)
  
  #output
  list(training = training_set, testing = testing_set)
}

First, we will use df_cl_base as the main source of the modelling.

main_source = df_cl_base
train_test_df <- create_train_test_split(main_source, 0.7)
training <- train_test_df$training
testing <- train_test_df$testing

Step 3: Decision Tree Analysis

model <- C50::C5.0(isBestSeller ~., 
                   data = training)
summary(model)
## 
## Call:
## C5.0.formula(formula = isBestSeller ~ ., data = training)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Fri Apr 19 17:21:24 2024
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 16099 cases (8 attributes) from undefined.data
## 
## Decision tree:
## 
## boughtInLastMonth <= 300: no (13669/537)
## boughtInLastMonth > 300:
## :...reviews <= 2359: no (1274/129)
##     reviews > 2359:
##     :...boughtInLastMonth > 1000:
##         :...boughtInLastMonth > 7000: yes (15/4)
##         :   boughtInLastMonth <= 7000:
##         :   :...department in {ArtsCraftsSewing,Automotive,BeautyPersonalCare,
##         :       :              Fashion,IndustrialScientific,
##         :       :              ToysGames}: yes (12/2)
##         :       department in {BathBody,ClothingShoesJewelry,HouseholdSupplies,
##         :       :              PatioLawnGarden,SportsOutdoors}: no (6/2)
##         :       department = BabyProducts:
##         :       :...discountPercentage <= 19: no (8/2)
##         :       :   discountPercentage > 19: yes (5)
##         :       department = Electronics:
##         :       :...stars <= 4.7: yes (2)
##         :       :   stars > 4.7: no (2)
##         :       department = GroceryGourmetFood:
##         :       :...stars <= 4.5: yes (4/1)
##         :       :   stars > 4.5: no (8)
##         :       department = HealthHousehold:
##         :       :...boughtInLastMonth <= 5000: no (34/11)
##         :       :   boughtInLastMonth > 5000: yes (2)
##         :       department = HomeKitchen:
##         :       :...price <= 42.48: yes (32/13)
##         :       :   price > 42.48: no (5)
##         :       department = ToolsHomeImprovement:
##         :       :...price <= 69.9: yes (4)
##         :       :   price > 69.9: no (2)
##         :       department = PetSupplies:
##         :       :...discountPercentage > 19: no (4)
##         :       :   discountPercentage <= 19:
##         :       :   :...price <= 14.63: no (2)
##         :       :       price > 14.63: yes (8/1)
##         :       department = Beauty:
##         :       :...reviews > 56776: yes (5)
##         :           reviews <= 56776:
##         :           :...titleLength <= 189: no (28/2)
##         :               titleLength > 189:
##         :               :...reviews <= 3968: no (2)
##         :                   reviews > 3968: yes (12/1)
##         boughtInLastMonth <= 1000:
##         :...department in {BabyProducts,BathBody,Beauty,BeautyPersonalCare,
##             :              GroceryGourmetFood,HealthHousehold,
##             :              HouseholdSupplies,ToysGames}: no (423/51)
##             department in {ArtsCraftsSewing,Automotive,ClothingShoesJewelry,
##             :              Electronics,Fashion,HomeKitchen,
##             :              IndustrialScientific,PatioLawnGarden,PetSupplies,
##             :              SportsOutdoors,ToolsHomeImprovement}:
##             :...boughtInLastMonth <= 800:
##                 :...department in {ArtsCraftsSewing,Automotive,HomeKitchen,
##                 :   :              IndustrialScientific,PatioLawnGarden,
##                 :   :              PetSupplies}: no (263/61)
##                 :   department = ClothingShoesJewelry:
##                 :   :...price <= 23.49: no (5)
##                 :   :   price > 23.49: yes (5/1)
##                 :   department = Electronics:
##                 :   :...titleLength <= 88: yes (5/1)
##                 :   :   titleLength > 88: no (17/1)
##                 :   department = Fashion:
##                 :   :...price <= 21.52: no (8)
##                 :   :   price > 21.52: yes (5/1)
##                 :   department = ToolsHomeImprovement:
##                 :   :...boughtInLastMonth > 600: no (9)
##                 :   :   boughtInLastMonth <= 600:
##                 :   :   :...price <= 44.99: yes (14/5)
##                 :   :       price > 44.99: no (6)
##                 :   department = SportsOutdoors:
##                 :   :...reviews <= 3729: no (12)
##                 :       reviews > 3729:
##                 :       :...boughtInLastMonth > 500: yes (18/4)
##                 :           boughtInLastMonth <= 500:
##                 :           :...reviews <= 34418: no (16/2)
##                 :               reviews > 34418: yes (3)
##                 boughtInLastMonth > 800:
##                 :...department in {ArtsCraftsSewing,Electronics,
##                     :              IndustrialScientific,PatioLawnGarden,
##                     :              ToolsHomeImprovement}: no (37/11)
##                     department = Fashion: yes (7/3)
##                     department = Automotive:
##                     :...price <= 12.49: no (2)
##                     :   price > 12.49: yes (5)
##                     department = ClothingShoesJewelry:
##                     :...stars <= 4.2: yes (2)
##                     :   stars > 4.2: no (2)
##                     department = PetSupplies:
##                     :...discountPercentage <= 14: no (5)
##                     :   discountPercentage > 14:
##                     :   :...reviews <= 19158: yes (12/2)
##                     :       reviews > 19158: no (3)
##                     department = SportsOutdoors:
##                     :...stars > 4.4: yes (6)
##                     :   stars <= 4.4:
##                     :   :...reviews <= 17800: no (4)
##                     :       reviews > 17800: yes (3)
##                     department = HomeKitchen:
##                     :...price <= 11.98: yes (5)
##                         price > 11.98:
##                         :...boughtInLastMonth > 900: no (42/13)
##                             boughtInLastMonth <= 900:
##                             :...titleLength <= 73: no (3)
##                                 titleLength > 73: yes (7/1)
## 
## 
## Evaluation on training data (16099 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      55  862( 5.4%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##   15079    40    (a): class no
##     822   158    (b): class yes
## 
## 
##  Attribute usage:
## 
##  100.00% boughtInLastMonth
##   15.09% reviews
##    7.09% department
##    0.99% price
##    0.46% titleLength
##    0.29% discountPercentage
##    0.20% stars
## 
## 
## Time: 0.1 secs

We can clearly see that it has large number of False Negative (the product is Best Seller, but classified as Not).

We will improve the result by do the Boosting, with number of trials = 10.

model_boost <- C5.0(isBestSeller ~.,
                    data = main_source,
                    trials = 10)

# Predicting on the testing data
print(paste("Confusion matrix of Model Boosting"))
## [1] "Confusion matrix of Model Boosting"
pred.test <- predict(model_boost, testing)
CrossTable(testing$isBestSeller, pred.test,
           prop.chisq = FALSE,
           prop.c = FALSE,
           prop.r = FALSE,
           prop.t = FALSE,
           dnn = c("Actual", "Predicted"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  6901 
## 
##  
##              | Predicted 
##       Actual |        no |       yes | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |      6470 |        11 |      6481 | 
## -------------|-----------|-----------|-----------|
##          yes |       361 |        59 |       420 | 
## -------------|-----------|-----------|-----------|
## Column Total |      6831 |        70 |      6901 | 
## -------------|-----------|-----------|-----------|
## 
## 

Notice that the False Negative Rate is still big!

We also had tried to adjust the number of trial until max (100), but the number of False Negative is still many! The False Negative Rate was (in trial = 100) = (337/420) = 80% !! We did not show it here (for the shake to save your computer memory and reduce time), but The Lecturer can always try to change the trials above to 100 to proof our argument.

Next, we will jump to use Random Forest because boosting has not resulted in bigger leap of performance. We will start to define the number of tree into 50.

input_no_tree = 50
model.forest <- randomForest::randomForest(isBestSeller ~., 
                                           data = main_source,
                                           ntree = input_no_tree, # how many trees should be grown?
                                           mtry = 2, # how many variables to sample at each split?
                                           replace = TRUE,
                                           importance = TRUE) # sampling of cases with or without replacement?

Next, check the confusion matrix

print(paste("Confusion matrix of Random Forest", "ntree=", input_no_tree))
## [1] "Confusion matrix of Random Forest ntree= 50"
pred.test <- predict(model.forest, testing)
CrossTable(testing$isBestSeller, pred.test,
           prop.chisq = FALSE,
           prop.c = FALSE,
           prop.r = FALSE,
           prop.t = FALSE,
           dnn = c("Actual", "Predicted"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  6901 
## 
##  
##              | Predicted 
##       Actual |        no |       yes | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |      6481 |         0 |      6481 | 
## -------------|-----------|-----------|-----------|
##          yes |         6 |       414 |       420 | 
## -------------|-----------|-----------|-----------|
## Column Total |      6487 |       414 |      6901 | 
## -------------|-----------|-----------|-----------|
## 
## 

NOTICE that the performance leap very good! with accuracy (6481+415)/6901 = 99,9%!

Next, choosing the number of tree into 100 will make the accuracy to be quite the same with tree = 50.

input_no_tree = 100
model.forest <- randomForest::randomForest(isBestSeller ~., 
                                           data = main_source,
                                           ntree = input_no_tree, # how many trees should be grown?
                                           mtry = 2, # how many variables to sample at each split?
                                           replace = TRUE,
                                           importance = TRUE) # sampling of cases with or without replacement?

print(paste("Confusion matrix of Random Forest", "ntree=", input_no_tree))
## [1] "Confusion matrix of Random Forest ntree= 100"
pred.test <- predict(model.forest, testing)
CrossTable(testing$isBestSeller, pred.test,
           prop.chisq = FALSE,
           prop.c = FALSE,
           prop.r = FALSE,
           prop.t = FALSE,
           dnn = c("Actual", "Predicted"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  6901 
## 
##  
##              | Predicted 
##       Actual |        no |       yes | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |      6481 |         0 |      6481 | 
## -------------|-----------|-----------|-----------|
##          yes |         1 |       419 |       420 | 
## -------------|-----------|-----------|-----------|
## Column Total |      6482 |       419 |      6901 | 
## -------------|-----------|-----------|-----------|
## 
## 

How can it be very well performed?

There are some arguments why C5.0 vs Random Forest results in very different accuracy:

  1. C50 lays on simpler Decision Tree methods [10] while RandomForest library uses more advanced methods (multiple trees) 11]. Based on [12], the RandomForest is more robust and has higher accuracy.
  2. It is likely that Amazon itself uses the RandomForest algorithm in deciding its Best Selling, thus we cracked the Amazon Code. The information from Amazon [12] also aligns with the definition of predictor variables to used in determining the isBestSeller or not. There is also a research by Michael (2018, [13]) that also do randomforest prediction and gained quite high accuracy (>80%)!

Step 4: Extract the Variable Importance

Next thing to do, after determining the correct model, we will extract in each of the Department, what variables are the most important!!

We will focus on five top most Department with highest number of isBestSeller - Sports and Outdoor - Automotive - Clothing, Shoes, and Jewelery - Electronics - Fashion As proof, please check part 2.6.

# Extract variable importance
importance <- importance(model.forest)
print(importance)
##                           no       yes MeanDecreaseAccuracy MeanDecreaseGini
## department         31.001670  8.739757            32.553018         295.5744
## stars              10.659643  2.623006            10.615952         230.4975
## reviews             1.936198 21.059314            10.879230         550.8376
## price              13.691987  5.005392            14.536045         441.0056
## boughtInLastMonth  27.342615 41.803666            39.021249         325.2085
## discountPercentage  7.687595  5.501632             8.676424         324.0392
## titleLength         8.759896  4.837539             9.531634         425.5198
# Plot
varImpPlot(model.forest)

There are two indexes that are used to evaluate the Random Forest here: MDA (Mean Decrease Accuracy) and MDG (Mean Decrease Gini). - Mean Decrease in Gini / Impurity (MDG/MDI): measure how much each variable contributes to the homogenity of the nodes and leaves in the trees. Variables that lead to the most significant decrease in impurity when they are used to split a node are considered more important. MDI works well when the classes are balanced. However, it can be biased toward features with more categories when classes are imbalanced. - Mean Decrease in Accuracy (MDA): This measure is calculated by permuting the values of each predictor variable across the out-of-bag (OOB) samples for each tree and observing the effect on accuracy. It is more robust to class imbalance.

Since each of the class (isBestSeller = yes or no) is not balanced, then we will focus on evaluating the most important variables based on MDA. In this case, when using the base dataframe, the most important variables are: Department, revenue, price, and boughtInLastMonth.

Reflection based on this:

  1. Department: since each of the department can have different characteristics, we will deep dive into only 5 departments (also explained above).

  2. Revenue: This variable actually results in bias. Because revenue is based on the price x boughtInLastMonth. It is not clear that which one comes first: boughtInLastMonth or the isBestSeller. In other words, the Amazon labels a product as BestSeller because it has boughtInLastMonth or it has high boughtInLastMonth because it is BestSeller? We decide to drop one of this variable in modeling, because we did not get the clarity regarding to this.

  3. Price: indeed important.

Hence, in this modeling part, variable Revenue will be excluded. We will do it also for listPrice, because it has high correlation with Price (0.99) and this will just increase the bias.

column_for_base <- c("isBestSeller", "department", "stars", "reviews", "price", "discountPercentage", "titleLength", "boughtInLastMonth")
df_cl_base <- df_cl[column_for_base]

# Only choose top 5 departments
departments <- c("SportsOutdoors", "Automotive", "ClothingShoesJewelry", "Electronics", "Fashion")

df_cl_tmp <- df_cl_base
input_no_tree = 50

# Loop over each department
for (department_name in departments) {
  # Filter the data for the current department
  main_source <- df_cl_tmp %>% filter(department == department_name)
  
  # Perform train-test split
  train_test_df <- create_train_test_split(main_source, 0.7)
  training <- train_test_df$training
  testing <- train_test_df$testing
  
  # Train the Random Forest model
  model.forest <- randomForest(isBestSeller ~., data = training, ntree = input_no_tree, mtry = 2, replace = TRUE, importance = TRUE)
  
  # Plot variable importance with dynamic title
  varImpPlot(model.forest, main = paste("Variable Importance for", department_name))
  
  print(paste("Confusion matrix of Random Forest for Dept = ", department_name))
  pred.test <- predict(model.forest, training)
  CrossTable(training$isBestSeller, pred.test,
           prop.chisq = FALSE,
           prop.c = FALSE,
           prop.r = FALSE,
           prop.t = FALSE,
           dnn = c("Actual", "Predicted"))
}

## [1] "Confusion matrix of Random Forest for Dept =  SportsOutdoors"
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  1267 
## 
##  
##              | Predicted 
##       Actual |        no |       yes | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |      1133 |         0 |      1133 | 
## -------------|-----------|-----------|-----------|
##          yes |         3 |       131 |       134 | 
## -------------|-----------|-----------|-----------|
## Column Total |      1136 |       131 |      1267 | 
## -------------|-----------|-----------|-----------|
## 
## 

## [1] "Confusion matrix of Random Forest for Dept =  Automotive"
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  641 
## 
##  
##              | Predicted 
##       Actual |        no |       yes | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |       568 |         0 |       568 | 
## -------------|-----------|-----------|-----------|
##          yes |         1 |        72 |        73 | 
## -------------|-----------|-----------|-----------|
## Column Total |       569 |        72 |       641 | 
## -------------|-----------|-----------|-----------|
## 
## 

## [1] "Confusion matrix of Random Forest for Dept =  ClothingShoesJewelry"
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  452 
## 
##  
##              | Predicted 
##       Actual |        no |       yes | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |       406 |         0 |       406 | 
## -------------|-----------|-----------|-----------|
##          yes |         2 |        44 |        46 | 
## -------------|-----------|-----------|-----------|
## Column Total |       408 |        44 |       452 | 
## -------------|-----------|-----------|-----------|
## 
## 

## [1] "Confusion matrix of Random Forest for Dept =  Electronics"
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  515 
## 
##  
##              | Predicted 
##       Actual |        no |       yes | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |       469 |         0 |       469 | 
## -------------|-----------|-----------|-----------|
##          yes |         0 |        46 |        46 | 
## -------------|-----------|-----------|-----------|
## Column Total |       469 |        46 |       515 | 
## -------------|-----------|-----------|-----------|
## 
## 

## [1] "Confusion matrix of Random Forest for Dept =  Fashion"
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  621 
## 
##  
##              | Predicted 
##       Actual |        no |       yes | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |       575 |         0 |       575 | 
## -------------|-----------|-----------|-----------|
##          yes |         1 |        45 |        46 | 
## -------------|-----------|-----------|-----------|
## Column Total |       576 |        45 |       621 | 
## -------------|-----------|-----------|-----------|
## 
## 

From each of the graph above, we can clearly see which variables give more importance/ weight in the RandomForest model. For example, Fashion has the ranking of the most important variables:

  1. Price

  2. boughtInLastMonth

  3. reviews

  4. discountPercentage

Which ones should we choose?

We do it visually based on the group and distance between MDA. For example, if you see on the 5th graph (Fashion), the MDA distance between price and boughtInLastMonth are quite far from reviews and the rest. Thus, we will only focus on Price and boughtInLastMonth Variable.

Further summary will be done in PART 4. Next, let us do the modeling further: LOGISTIC REGRESSION.

3.3. Logistic Regression

Reasons in Choosing Logistic Regression

Logistic regression is chosen because it is a powerful tool, allowing multiple explanatory variables being analyzed simultaneously, meanwhile reducing the effect of confounding factors (Sperandei, 2014 [14]). Moreover, logistic regression is particularly useful for binary classification task, where the goal is to predict one of two possible classes (i.e. in this case best seller or not best seller).

Steps in Logistics Regressions

To keep on track in this long modeling process, the blackbox of the regression will be a little bit opened. Here, in this research, the built story in Logistics Regressions are:

  1. Preparing: subsetting and prepare mainly to split train-test with ration 70:30.

  2. Executing the Regression Model: in this modeling part, there are two main dataframe that will be used df_mod (containing all department) and df_modeling (containing top 5 department). It is necessary so the analysis start from helicopter view, then focus to 5 departments (to answer the main business problems).

  3. Extracting Coefficient: coefficient in regression represents the weight importance of the variables in each of the departments.

  4. Plotting the Coefficient.

  5. Analyze the Z-Value: It is necessary to find the effect of Explanatory Variables to the Outcome. If this z-value is statistically significant (typically |z| > 1.96 for a 95% confidence interval), it implies that the predictor significantly increases the odds of the outcome occurring, controlling for other factors.

  6. Plotting the Partial Effect: Too deepen the analysis (thus increase the accuracy of the model), coefficient and z-value is not enough to explain the impact of each of the variables in determining Best Seller. Partial Effect is proposed with mechanism: what will be the effect of one variable when other variables are set constant.

  7. Model Evaluation: using AUC and ROC. And finally, Confusion Matrix.

  8. Further Improvement: After examining the Confusion Matrix, we found that some of the inaccuracy especially Type II error (isBestSeller=“yes” predicted as “no”). We can still improve the model performance by defining Optimal Threshold of the logistics regression model. The concept is: by using ROC curve to find the optimum threshold of output probability that can maximize recall performance, prevent the Type II error. For example: we could not merely say for all of the model that threshold of probability >0.5 means that it must be istBestSeller=“yes”.

  9. Recalculate after the improvement.

Step 1: Preparing Dataset for Logistic Regression

We prepare two dataframe for the logistic regression (i.e. df_mod and df_modeling). df_modeling is a subset dataframe from df_model which is only contain data of top 5 department based on percentage best seller of total product in each department.

df_modeling_all <- df_model
df_mod <- df_model
df_modeling <- subset(df_model, department %in% c("Sports & Outdoors", "Clothing, Shoes & Jewelry", "Automotive", "Electronics", "Fashion"))
department_dfs <- split(df_modeling, df_modeling$department)

Preparing Dataset for Logistic Regression Model Training and Testing We prepare dataset for logistic regression model training and testing with the ratio of 70% for training dataset and 30% for testing dataset. We set seed in this dataset split so it can be reproduced.

#Preparing dataset for training and testing
set.seed(123)
split_ratio <- 0.7

#Dataset training and testing for all department dataset
indices <- sample(1:nrow(df_modeling_all), size = floor(split_ratio * nrow(df_modeling_all)))
df_train_all <- df_modeling_all[indices, ]
df_test_all <- df_modeling_all[-indices, ]

#Dataset training and testing for top 5 department
train_dfs <- list()
test_dfs <- list()

# Looping through each department
for (department in names(department_dfs)) {
  df <- department_dfs[[department]]
  indices <- sample(1:nrow(df), size = floor(split_ratio * nrow(df)))
  
  # Subsetting the data frame into training and testing sets
  train_df <- df[indices, ]
  test_df <- df[-indices, ]
  
  # Finally, storing the dataset above
  train_dfs[[department]] <- train_df
  test_dfs[[department]] <- test_df
}

Step 2: Executing the Logistic Regression

We perform the logistic regression modelling for the df_mod (containing all department) and df_modeling (containing top 5 department). The dependent variable is isBestSeller. Then, there are six independent variables: discountPercentage_log_normalized, price_log_normalized, reviews_log_normalized, titleLength_normalized, stars_normalized, boughtInLastMonth. Then we store all model and estimated coefficient plot in a list.

# First, define an empty list to store model summaries for each department
model_summaries <- list()
models <- list()
coef_plots <- list()

# Then, let us do the modeling part
model <- glm(isBestSeller ~ discountPercentage_log_normalized + price_log_normalized + 
                 reviews_log_normalized + titleLength_normalized + stars_normalized + boughtInLastMonth, 
               data = df_train_all, family = binomial)

# Storing the summary 
model_summaries[["all"]] <- summary(model)
models[["all"]] <- model

# Plotting coefficients for the current model with department name annotation
  coef_plots[["all"]] <- coefplot::coefplot(
    model,
    title = paste("Coefficients Plot for all department"),
    ylab = "Variables",
    xlab = "Estimated Coefficient"
  )

# Looping through each department
for (department in names(train_dfs)) {
  # Getting the training data frame for the current department
  df_training <- train_dfs[[department]]
  
  # Fitting the logistic regression model
  model <- glm(isBestSeller ~ discountPercentage_log_normalized + price_log_normalized + 
                 reviews_log_normalized + titleLength_normalized + stars_normalized + boughtInLastMonth, 
               data = df_training, family = binomial)
  
  # Storing to df
  model_summaries[[department]] <- summary(model)
  models[[department]] <- model
  
  # Plotting coefficients
  coef_plots[[department]] <- coefplot::coefplot(
    model,
    title = paste("Coefficients Plot for", department),
    ylab = "Variables",
    xlab = "Estimated Coefficient"
  )
}

print(model_summaries)
## $all
## 
## Call:
## glm(formula = isBestSeller ~ discountPercentage_log_normalized + 
##     price_log_normalized + reviews_log_normalized + titleLength_normalized + 
##     stars_normalized + boughtInLastMonth, family = binomial, 
##     data = df_train_all)
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -6.362e+00  5.099e-01 -12.478  < 2e-16 ***
## discountPercentage_log_normalized  1.452e+00  2.561e-01   5.668 1.44e-08 ***
## price_log_normalized               1.080e+00  3.005e-01   3.593 0.000327 ***
## reviews_log_normalized             3.629e+00  2.381e-01  15.241  < 2e-16 ***
## titleLength_normalized             2.618e-01  2.477e-01   1.057 0.290518    
## stars_normalized                   1.223e-01  5.155e-01   0.237 0.812527    
## boughtInLastMonth                  6.312e-04  4.584e-05  13.769  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7412.4  on 16098  degrees of freedom
## Residual deviance: 6682.9  on 16092  degrees of freedom
## AIC: 6696.9
## 
## Number of Fisher Scoring iterations: 6
## 
## 
## $Automotive
## 
## Call:
## glm(formula = isBestSeller ~ discountPercentage_log_normalized + 
##     price_log_normalized + reviews_log_normalized + titleLength_normalized + 
##     stars_normalized + boughtInLastMonth, family = binomial, 
##     data = df_training)
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -6.849721   2.027905  -3.378 0.000731 ***
## discountPercentage_log_normalized  1.340927   0.991688   1.352 0.176322    
## price_log_normalized               1.763517   1.137965   1.550 0.121211    
## reviews_log_normalized             0.168493   0.999067   0.169 0.866071    
## titleLength_normalized             1.160666   1.086226   1.069 0.285281    
## stars_normalized                   2.373937   2.099565   1.131 0.258190    
## boughtInLastMonth                  0.004601   0.000751   6.127 8.93e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 466.72  on 640  degrees of freedom
## Residual deviance: 397.29  on 634  degrees of freedom
## AIC: 411.29
## 
## Number of Fisher Scoring iterations: 5
## 
## 
## $`Clothing, Shoes & Jewelry`
## 
## Call:
## glm(formula = isBestSeller ~ discountPercentage_log_normalized + 
##     price_log_normalized + reviews_log_normalized + titleLength_normalized + 
##     stars_normalized + boughtInLastMonth, family = binomial, 
##     data = df_training)
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                       -6.9621715  2.5273931  -2.755  0.00587 **
## discountPercentage_log_normalized  3.1817965  1.2105167   2.628  0.00858 **
## price_log_normalized              -1.4228062  2.2587457  -0.630  0.52875   
## reviews_log_normalized             2.6031993  1.1282923   2.307  0.02104 * 
## titleLength_normalized             0.2949509  1.7764749   0.166  0.86813   
## stars_normalized                   2.3200321  2.4559017   0.945  0.34482   
## boughtInLastMonth                  0.0006578  0.0004741   1.387  0.16529   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 284.09  on 451  degrees of freedom
## Residual deviance: 261.11  on 445  degrees of freedom
## AIC: 275.11
## 
## Number of Fisher Scoring iterations: 6
## 
## 
## $Electronics
## 
## Call:
## glm(formula = isBestSeller ~ discountPercentage_log_normalized + 
##     price_log_normalized + reviews_log_normalized + titleLength_normalized + 
##     stars_normalized + boughtInLastMonth, family = binomial, 
##     data = df_training)
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                        0.0761190  2.7323220   0.028  0.97777   
## discountPercentage_log_normalized -1.1257355  1.2088301  -0.931  0.35172   
## price_log_normalized              -2.5088483  1.3223378  -1.897  0.05779 . 
## reviews_log_normalized             3.4306855  1.1580737   2.962  0.00305 **
## titleLength_normalized            -1.4025858  1.2698695  -1.105  0.26937   
## stars_normalized                  -2.3369980  2.6540408  -0.881  0.37857   
## boughtInLastMonth                  0.0003702  0.0002570   1.441  0.14972   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 305.32  on 514  degrees of freedom
## Residual deviance: 287.16  on 508  degrees of freedom
## AIC: 301.16
## 
## Number of Fisher Scoring iterations: 5
## 
## 
## $Fashion
## 
## Call:
## glm(formula = isBestSeller ~ discountPercentage_log_normalized + 
##     price_log_normalized + reviews_log_normalized + titleLength_normalized + 
##     stars_normalized + boughtInLastMonth, family = binomial, 
##     data = df_training)
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -1.095e+01  2.917e+00  -3.753 0.000175 ***
## discountPercentage_log_normalized -7.812e-02  1.181e+00  -0.066 0.947241    
## price_log_normalized               7.226e+00  1.917e+00   3.770 0.000163 ***
## reviews_log_normalized             1.931e+00  1.144e+00   1.688 0.091355 .  
## titleLength_normalized             1.097e+00  1.583e+00   0.693 0.488460    
## stars_normalized                   4.010e+00  2.838e+00   1.413 0.157669    
## boughtInLastMonth                  1.434e-03  3.905e-04   3.674 0.000239 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 317.75  on 620  degrees of freedom
## Residual deviance: 275.20  on 614  degrees of freedom
## AIC: 289.2
## 
## Number of Fisher Scoring iterations: 6
## 
## 
## $`Sports & Outdoors`
## 
## Call:
## glm(formula = isBestSeller ~ discountPercentage_log_normalized + 
##     price_log_normalized + reviews_log_normalized + titleLength_normalized + 
##     stars_normalized + boughtInLastMonth, family = binomial, 
##     data = df_training)
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -4.1562730  1.2664749  -3.282  0.00103 ** 
## discountPercentage_log_normalized  0.8363204  0.6638110   1.260  0.20771    
## price_log_normalized               1.9417482  0.8790425   2.209  0.02718 *  
## reviews_log_normalized             0.6212382  0.6199952   1.002  0.31634    
## titleLength_normalized            -0.4996661  0.7115235  -0.702  0.48252    
## stars_normalized                  -0.0542106  1.3133335  -0.041  0.96707    
## boughtInLastMonth                  0.0034993  0.0004607   7.596 3.06e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 901.36  on 1266  degrees of freedom
## Residual deviance: 808.38  on 1260  degrees of freedom
## AIC: 822.38
## 
## Number of Fisher Scoring iterations: 5

Interpretation of the logistic regression models

Logistic Regression Model 1 (All Department): All Department The coefficients for the predictors show their impact on the log odds of the event. Notably, the discount percentage, price, reviews, and amount of product bought in the last month are statistically significant predictors, with positive coefficients indicating a positive association with the event. The deviance residuals indicate how well the model fits the data, with values around zero suggesting a good fit. The null deviance represents the model’s fit with just the intercept, while the residual deviance shows the improvement in fit with the added predictors. Overall, this model suggests that discounts, lower prices, higher reviews, and recent purchases strongly influence the product to be a best seller.

Logistic Regression Model 2 (Automotive Department): Automotive The coefficients for discount percentage, price, and whether the product was bought in the last month are not statistically significant at conventional levels, indicating weak or uncertain associations with the event. On the other hand, the coefficient for “boughtInLastMonth” is significant, suggesting a strong positive association with the the probability a product to be a best seller. The deviance residuals suggest an acceptable model fit, with values around zero indicating a reasonably good fit to the data. The null deviance, representing the model with just the intercept, decreases significantly with the added predictors, indicating improved model fit. Overall, this model suggests that the number of recent purchases are the most influential predictor of the event, while other factors like discounts, prices, and reviews have less clear or weaker effects.

Logistic Regression Model 3 (Clothing, Shoes & Jewelry Department): Clothing, Shoes & Jewelry Among the predictors, only “discountPercentage_log_normalized” and “reviews_log_normalized” show statistically significant coefficients, indicating a positive association with the probability of a product becoming best seller. The deviance residuals indicate a reasonable model fit, with values around zero suggesting an acceptable fit to the data. The null deviance decreases with the addition of predictors, indicating an improvement in model fit. However, the coefficients for “price_log_normalized,” “titleLength_normalized,” “stars_normalized,” and “boughtInLastMonth” are not statistically significant at conventional levels, suggesting weaker associations with the predicted event. Overall, this model suggests that discounts and higher reviews have a notable impact on the predicted event, while other factors may have less clear or weaker effects.

Logistic Regression Model 4 (Electronics Department): Electronics Among the predictors, only “reviews_log_normalized” has a statistically significant positive coefficient, suggesting a strong positive association with the event. However, “price_log_normalized” approaches significance with a p-value of 0.09, indicating a possible weak negative association. The other predictors, including discount percentage, title length, stars, and whether the product was bought in the last month, are not statistically significant. The deviance residuals indicate a reasonable fit, with values around zero suggesting an acceptable fit to the data. The null deviance decreases with the addition of predictors, indicating an improvement in model fit. Overall, this model suggests that higher reviews are the most influential predictor of the event, while other factors may have weaker or less clear effects.

Logistic Regression Model 5 (Fashion Department): Fashion Among the predictors, “price_log_normalized” and “boughtInLastMonth” have statistically significant positive coefficients, indicating a positive association with the event. In contrast, “reviews_log_normalized” approaches significance with a p-value of 0.06, suggesting a possible positive association. The other predictors, including discount percentage, title length, and stars, are not statistically significant. The deviance residuals indicate a reasonable fit, with values around zero suggesting an acceptable fit to the data. The null deviance decreases with the addition of predictors, indicating an improvement in model fit. However, the AIC suggests that this model may not be the best fit compared to other potential models. Overall, this model suggests that higher prices, recent purchases, and potentially higher reviews are associated with the predicted event, while other factors may have weaker or less clear effects.

Logistic Regression Model 6 (Sports & Outdoors): Sports & Outdoors Among the predictors, both the price of the product and recent purchases exhibit significant positive associations with the event. Higher prices and products bought in the last month are linked to increased likelihood of the event occurring. Conversely, discount percentage, reviews, title length, and stars show weaker or non-significant relationships with the predicted outcome. The deviance residuals suggest an acceptable fit of the model to the data, while the AIC value indicates its relative performance compared to other models. Overall, this analysis highlights the importance of pricing strategies and recent customer behavior as key drivers of the predicted event, providing actionable insights for decision-making and marketing strategies.

Step 3: Extracting Coefficient

It represents the weight importance of the variables in each of the departments.

# Initializing an empty data frame to store coefficients from all models
all_coefs <- data.frame(
  Department = character(),
  Term = character(),
  Estimate = numeric(),
  Pr.value = numeric(),
  stringsAsFactors = FALSE  # Avoid factors to make data manipulation easier
)

# Looping through each department and extract coefficients
for (department in c("all", names(train_dfs))) {
  # Extracting the coefficients table
  coefs <- summary(models[[department]])$coefficients
  
  # Creating a temporary data frame to store coefficients for current model
  temp_df <- data.frame(
    Department = department,
    Term = rownames(coefs),
    Estimate = coefs[, "Estimate"],
    Pr.value = coefs[, "Pr(>|z|)"],
    stringsAsFactors = FALSE
  )
  
  # Binding the temporary data frame to the main data frame
  all_coefs <- rbind(all_coefs, temp_df)
}

# Printing the final coefficients table
print(all_coefs)
##                                                   Department
## (Intercept)                                              all
## discountPercentage_log_normalized                        all
## price_log_normalized                                     all
## reviews_log_normalized                                   all
## titleLength_normalized                                   all
## stars_normalized                                         all
## boughtInLastMonth                                        all
## (Intercept)1                                      Automotive
## discountPercentage_log_normalized1                Automotive
## price_log_normalized1                             Automotive
## reviews_log_normalized1                           Automotive
## titleLength_normalized1                           Automotive
## stars_normalized1                                 Automotive
## boughtInLastMonth1                                Automotive
## (Intercept)2                       Clothing, Shoes & Jewelry
## discountPercentage_log_normalized2 Clothing, Shoes & Jewelry
## price_log_normalized2              Clothing, Shoes & Jewelry
## reviews_log_normalized2            Clothing, Shoes & Jewelry
## titleLength_normalized2            Clothing, Shoes & Jewelry
## stars_normalized2                  Clothing, Shoes & Jewelry
## boughtInLastMonth2                 Clothing, Shoes & Jewelry
## (Intercept)3                                     Electronics
## discountPercentage_log_normalized3               Electronics
## price_log_normalized3                            Electronics
## reviews_log_normalized3                          Electronics
## titleLength_normalized3                          Electronics
## stars_normalized3                                Electronics
## boughtInLastMonth3                               Electronics
## (Intercept)4                                         Fashion
## discountPercentage_log_normalized4                   Fashion
## price_log_normalized4                                Fashion
## reviews_log_normalized4                              Fashion
## titleLength_normalized4                              Fashion
## stars_normalized4                                    Fashion
## boughtInLastMonth4                                   Fashion
## (Intercept)5                               Sports & Outdoors
## discountPercentage_log_normalized5         Sports & Outdoors
## price_log_normalized5                      Sports & Outdoors
## reviews_log_normalized5                    Sports & Outdoors
## titleLength_normalized5                    Sports & Outdoors
## stars_normalized5                          Sports & Outdoors
## boughtInLastMonth5                         Sports & Outdoors
##                                                                 Term
## (Intercept)                                              (Intercept)
## discountPercentage_log_normalized  discountPercentage_log_normalized
## price_log_normalized                            price_log_normalized
## reviews_log_normalized                        reviews_log_normalized
## titleLength_normalized                        titleLength_normalized
## stars_normalized                                    stars_normalized
## boughtInLastMonth                                  boughtInLastMonth
## (Intercept)1                                             (Intercept)
## discountPercentage_log_normalized1 discountPercentage_log_normalized
## price_log_normalized1                           price_log_normalized
## reviews_log_normalized1                       reviews_log_normalized
## titleLength_normalized1                       titleLength_normalized
## stars_normalized1                                   stars_normalized
## boughtInLastMonth1                                 boughtInLastMonth
## (Intercept)2                                             (Intercept)
## discountPercentage_log_normalized2 discountPercentage_log_normalized
## price_log_normalized2                           price_log_normalized
## reviews_log_normalized2                       reviews_log_normalized
## titleLength_normalized2                       titleLength_normalized
## stars_normalized2                                   stars_normalized
## boughtInLastMonth2                                 boughtInLastMonth
## (Intercept)3                                             (Intercept)
## discountPercentage_log_normalized3 discountPercentage_log_normalized
## price_log_normalized3                           price_log_normalized
## reviews_log_normalized3                       reviews_log_normalized
## titleLength_normalized3                       titleLength_normalized
## stars_normalized3                                   stars_normalized
## boughtInLastMonth3                                 boughtInLastMonth
## (Intercept)4                                             (Intercept)
## discountPercentage_log_normalized4 discountPercentage_log_normalized
## price_log_normalized4                           price_log_normalized
## reviews_log_normalized4                       reviews_log_normalized
## titleLength_normalized4                       titleLength_normalized
## stars_normalized4                                   stars_normalized
## boughtInLastMonth4                                 boughtInLastMonth
## (Intercept)5                                             (Intercept)
## discountPercentage_log_normalized5 discountPercentage_log_normalized
## price_log_normalized5                           price_log_normalized
## reviews_log_normalized5                       reviews_log_normalized
## titleLength_normalized5                       titleLength_normalized
## stars_normalized5                                   stars_normalized
## boughtInLastMonth5                                 boughtInLastMonth
##                                         Estimate     Pr.value
## (Intercept)                        -6.361870e+00 9.857150e-36
## discountPercentage_log_normalized   1.451601e+00 1.441405e-08
## price_log_normalized                1.079846e+00 3.268504e-04
## reviews_log_normalized              3.628791e+00 1.888028e-52
## titleLength_normalized              2.618151e-01 2.905183e-01
## stars_normalized                    1.222539e-01 8.125267e-01
## boughtInLastMonth                   6.312080e-04 3.924224e-43
## (Intercept)1                       -6.849721e+00 7.308610e-04
## discountPercentage_log_normalized1  1.340927e+00 1.763222e-01
## price_log_normalized1               1.763517e+00 1.212110e-01
## reviews_log_normalized1             1.684935e-01 8.660714e-01
## titleLength_normalized1             1.160666e+00 2.852809e-01
## stars_normalized1                   2.373937e+00 2.581895e-01
## boughtInLastMonth1                  4.601374e-03 8.933543e-10
## (Intercept)2                       -6.962172e+00 5.874870e-03
## discountPercentage_log_normalized2  3.181797e+00 8.577209e-03
## price_log_normalized2              -1.422806e+00 5.287536e-01
## reviews_log_normalized2             2.603199e+00 2.104350e-02
## titleLength_normalized2             2.949509e-01 8.681321e-01
## stars_normalized2                   2.320032e+00 3.448242e-01
## boughtInLastMonth2                  6.578427e-04 1.652924e-01
## (Intercept)3                        7.611899e-02 9.777748e-01
## discountPercentage_log_normalized3 -1.125735e+00 3.517189e-01
## price_log_normalized3              -2.508848e+00 5.779067e-02
## reviews_log_normalized3             3.430686e+00 3.052442e-03
## titleLength_normalized3            -1.402586e+00 2.693712e-01
## stars_normalized3                  -2.336998e+00 3.785650e-01
## boughtInLastMonth3                  3.702028e-04 1.497172e-01
## (Intercept)4                       -1.094714e+01 1.748673e-04
## discountPercentage_log_normalized4 -7.812276e-02 9.472407e-01
## price_log_normalized4               7.225893e+00 1.630779e-04
## reviews_log_normalized4             1.931477e+00 9.135453e-02
## titleLength_normalized4             1.096688e+00 4.884604e-01
## stars_normalized4                   4.010136e+00 1.576687e-01
## boughtInLastMonth4                  1.434405e-03 2.392216e-04
## (Intercept)5                       -4.156273e+00 1.031595e-03
## discountPercentage_log_normalized5  8.363204e-01 2.077137e-01
## price_log_normalized5               1.941748e+00 2.717914e-02
## reviews_log_normalized5             6.212382e-01 3.163413e-01
## titleLength_normalized5            -4.996661e-01 4.825244e-01
## stars_normalized5                  -5.421061e-02 9.670750e-01
## boughtInLastMonth5                  3.499280e-03 3.063728e-14

Step 4: Plotting The Coefficient of Logistic Regression Model

Coefficient plots allow us to assess the direction (positive or negative) and magnitude of effects. Via this coefficient, we can analyze what is the most important (decisive) variables in each of the Department.

# Looping through each department and display its coefficient
for (department_name in names(coef_plots)) {
  plot <- coef_plots[[department_name]]
  print(plot)
}

Step 5: Displaying Z Values for All Logistic Regression Models

The z-value is calculated by dividing the regression coefficient (also known as the estimate) by the standard error. A positive z-score indicates that the raw score (associated with a predictor variable) is higher than the mean average. Conversely, a negative z-score suggests that the raw score is below the mean average.

filtered_all_coefs1 <- all_coefs[all_coefs$Department %in% c("all", "Automotive", "Clothing, Shoes & Jewelry","Electronics","Fashion","Sports & Outdoors"), ]

# Reshaping the data frame to a wide format
wide_df1 <- reshape(filtered_all_coefs1,
                   timevar = "Term",
                   idvar = "Department",
                   direction = "wide",
                   v.names = "Pr.value")
## Warning in reshapeWide(data, idvar = idvar, timevar = timevar, varying =
## varying, : some constant variables (Estimate) are really varying
# Cleaning up the names by removing the prefix ("z.value.") added by the reshape function
names(wide_df1) <- sub("", "", names(wide_df1))

# Setting the row names to the Department names 
rownames(wide_df1) <- wide_df1$Department
wide_df1$Department <- NULL  # Remove the now redundant Department column
wide_df1$"Pr.value.(Intercept)"<- NULL  # Remove the now redundant Department column
wide_df1$Estimate<- NULL  # Remove the now redundant Department column

# Showing the updated data frame with rounded values (excluding one specific column)
wide_df1
##                           Pr.value.discountPercentage_log_normalized
## all                                                     1.441405e-08
## Automotive                                              1.763222e-01
## Clothing, Shoes & Jewelry                               8.577209e-03
## Electronics                                             3.517189e-01
## Fashion                                                 9.472407e-01
## Sports & Outdoors                                       2.077137e-01
##                           Pr.value.price_log_normalized
## all                                        0.0003268504
## Automotive                                 0.1212109671
## Clothing, Shoes & Jewelry                  0.5287535560
## Electronics                                0.0577906673
## Fashion                                    0.0001630779
## Sports & Outdoors                          0.0271791352
##                           Pr.value.reviews_log_normalized
## all                                          1.888028e-52
## Automotive                                   8.660714e-01
## Clothing, Shoes & Jewelry                    2.104350e-02
## Electronics                                  3.052442e-03
## Fashion                                      9.135453e-02
## Sports & Outdoors                            3.163413e-01
##                           Pr.value.titleLength_normalized
## all                                             0.2905183
## Automotive                                      0.2852809
## Clothing, Shoes & Jewelry                       0.8681321
## Electronics                                     0.2693712
## Fashion                                         0.4884604
## Sports & Outdoors                               0.4825244
##                           Pr.value.stars_normalized Pr.value.boughtInLastMonth
## all                                       0.8125267               3.924224e-43
## Automotive                                0.2581895               8.933543e-10
## Clothing, Shoes & Jewelry                 0.3448242               1.652924e-01
## Electronics                               0.3785650               1.497172e-01
## Fashion                                   0.1576687               2.392216e-04
## Sports & Outdoors                         0.9670750               3.063728e-14
filtered_all_coefs2 <- all_coefs[all_coefs$Department %in% c("all", "Automotive", "Clothing, Shoes & Jewelry","Electronics","Fashion","Sports & Outdoors"), ]

# Reshaping the data frame to a wide format
wide_df2 <- reshape(filtered_all_coefs2,
                   timevar = "Term",
                   idvar = "Department",
                   direction = "wide",
                   v.names = "Estimate")
## Warning in reshapeWide(data, idvar = idvar, timevar = timevar, varying =
## varying, : some constant variables (Pr.value) are really varying
# Cleaning up the names by removing the prefix ("z.value.") added by the reshape function
names(wide_df2) <- sub("", "", names(wide_df2))

# Setting the row names to the Department names 
rownames(wide_df2) <- wide_df2$Department
wide_df2$Department <- NULL  # Remove the now redundant Department column
wide_df2$"Estimate.(Intercept)"<- NULL  # Remove the now redundant Department column
wide_df2$"Pr.value"<- NULL  # Remove the now redundant Department column

# Showing the updated data frame with rounded values (excluding one specific column)
wide_df2
##                           Estimate.discountPercentage_log_normalized
## all                                                       1.45160120
## Automotive                                                1.34092732
## Clothing, Shoes & Jewelry                                 3.18179651
## Electronics                                              -1.12573547
## Fashion                                                  -0.07812276
## Sports & Outdoors                                         0.83632041
##                           Estimate.price_log_normalized
## all                                            1.079846
## Automotive                                     1.763517
## Clothing, Shoes & Jewelry                     -1.422806
## Electronics                                   -2.508848
## Fashion                                        7.225893
## Sports & Outdoors                              1.941748
##                           Estimate.reviews_log_normalized
## all                                             3.6287906
## Automotive                                      0.1684935
## Clothing, Shoes & Jewelry                       2.6031993
## Electronics                                     3.4306855
## Fashion                                         1.9314766
## Sports & Outdoors                               0.6212382
##                           Estimate.titleLength_normalized
## all                                             0.2618151
## Automotive                                      1.1606663
## Clothing, Shoes & Jewelry                       0.2949509
## Electronics                                    -1.4025858
## Fashion                                         1.0966877
## Sports & Outdoors                              -0.4996661
##                           Estimate.stars_normalized Estimate.boughtInLastMonth
## all                                      0.12225389               0.0006312080
## Automotive                               2.37393712               0.0046013741
## Clothing, Shoes & Jewelry                2.32003209               0.0006578427
## Electronics                             -2.33699805               0.0003702028
## Fashion                                  4.01013609               0.0014344048
## Sports & Outdoors                       -0.05421061               0.0034992797

The further explanation of the z-value above will be explained in Part 4.1.

Calculating absolute z-score

While the z-score alone doesn’t directly indicate variable importance, comparing z-scores across predictors can provide insights into their relative impact. Larger absolute z-scores suggest stronger effects.

z_scores <- list()
for (department in names(models)) {
  coefficients <- coef(models[[department]])
  
  # Calculating z-scores for each coefficient
  z_scores[[department]] <- scale(coefficients, center = TRUE, scale = TRUE)
}

z_scores <- list()
for (department in names(models)) {
  # Getting the coefficients from the model summary
  coefficients <- coef(models[[department]])
  
  z_scores[[department]] <- abs(scale(coefficients[-1], center = TRUE, scale = TRUE))
}

z_score_plots <- list()

# Looping through each department's z-scores
for (department in names(z_scores)) {
  variables <- names(models[[department]]$coefficients)[-1]
  
  # Creating a data frame with variable names and absolute z-scores
  z_scores_df <- data.frame(
    Variable = variables,
    Abs_Z_Score = unname(z_scores[[department]])
  )
  
  # Creating a bar plot for absolute z-scores
  z_score_plot <- ggplot(z_scores_df, aes(x = reorder(Variable, Abs_Z_Score), y = Abs_Z_Score)) +
    geom_bar(stat = "identity", fill = "skyblue") +
    labs(title = paste("Absolute Z-Scores of Coefficients for", department),
         x = "Variables",
         y = "Absolute Z-Score") +
    theme_minimal() +
    coord_flip()

  z_score_plots[[department]] <- z_score_plot
}
# Displaying the plots
for (plot in z_score_plots) {
  print(plot)
}

Step 6: Plotting the Partial Effect

Furthermore, alongside coefficient in previous section, the partial effect measures how a single predictor variable influences the expected probability of the positive outcome in a logistic regression model. it quantifies the change in the predicted probability of the outcome for every unit change in the predictor variable while keeping other variables constant.

create_and_display_plots <- function(department, model) {
  # First, we need to compute the partial effects
  partial_effects <- allEffects(model)
  
  # Create ggplot objects for each effect
  plots <- lapply(partial_effects, function(effect) {
    effect_df <- as.data.frame(effect)
    p <- ggplot(effect_df, aes(x = eval(as.name(names(effect_df)[1])), y = fit)) +
      geom_line() +
      geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
      labs(title = paste(names(effect_df)[1]), x = names(effect_df)[1], y = "Effect") +
      theme_minimal() +
      theme(
        plot.title = element_text(size = 8),    # Adjust size: plot titles
        axis.title = element_text(size = 6),    # Adjust size: axis titles
        axis.text.x = element_text(size = 8),    # Adjust size: x-axis text
        axis.text.y = element_text(size = 8)     # Adjust size: y-axis text
      )
    return(p)
  })

  # Combine plots into a grid
  plot_grid <- marrangeGrob(plots, nrow = 2, ncol = 3, top = department)

  # Print the plot grid explicitly
  grid::grid.draw(plot_grid)
}
# Loop through each department model and display plots
for (department in names(models)) {
  model <- models[[department]]
  create_and_display_plots(department, model)
}

Step 7: Calculating Model Evaluation

To evaluate the model of logistics regression, ROC and AUC will be used. - ROC Curve: It is a plot of the true positive rate (sensitivity) against the false positive rate (1 - specificity) for different threshold values. The ROC curve helps visualize the trade-off between sensitivity and specificity. - AUC (Area Under the ROC Curve): It quantifies the overall performance of the classifier across all possible thresholds. A higher AUC value indicates better performance of the model in distinguishing between the two classes.

In this research, ROC is plotted to analyze the performance of AUC (notice that AUC could be gained from observing ROC plot).

model_evaluations <- list()

predicted_probabilities <- predict(models[["all"]], df_test_all, type = "response")
actual_labels_all <- df_test_all$isBestSeller

roc_curve <- roc(actual_labels_all, predicted_probabilities)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_value <- auc(roc_curve)
model_evaluations[["all"]] <- list(
AUC = auc_value,
PredictedProbabilities = predicted_probabilities,
ActualLabels = actual_labels_all
)

# Looping through each department's test dataset
for (department in names(test_dfs)) {
  # Getting the test data frame for the current department
  df_test <- test_dfs[[department]]
  
  # Getting the logistic regression model for the current department
  model_logistic <- models[[department]]
  
  # Predicting probabilities of being a best seller using the model
  predicted_probabilities <- predict(model_logistic, df_test, type = "response")
  
  # Evaluating the model's performance
  actual_labels <- df_test$isBestSeller  
  
  # Explicitly setting factor levels
  actual_labels <- factor(actual_labels, levels = c(0, 1))
  
  roc_curve <- roc(actual_labels, predicted_probabilities)
  auc_value <- auc(roc_curve)
  
  # Storing the model evaluation results
  model_evaluations[[department]] <- list(
    AUC = auc_value,
    PredictedProbabilities = predicted_probabilities,
    ActualLabels = actual_labels
  )
}
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Step 8: Plotting ROC Curve for Further Improvement

The ROC curve is a graphical representation of the trade-off between the true positive rate (sensitivity) and the false positive rate (1 - specificity) for different thresholds in a binary classification model. It helps us understand how well the model distinguishes between positive and negative instances across various decision boundaries.The AUC is a summary measure of the ROC curve.It quantifies the overall performance of the model by calculating the area under the ROC curve.An AUC of 0.5 corresponds to random guessing, while an AUC of 1.0 indicates perfect classification.

From the ROC curve of all department and top 5 department, only Electronics department that has the AUC value below 0.5. It means the model for elctronic department is the worst among six models.

# Width and height for the ROC plot
width <- 15
height <- 15
# Creating an empty list to store ROC curve plots for each department
roc_plots <- list()
# Looping through each department's model evaluations
for (department in names(model_evaluations)) {
  # Getting the predicted probabilities and actual labels for the current department
  predicted_probs <- model_evaluations[[department]]$PredictedProbabilities
  actual_labels <- model_evaluations[[department]]$ActualLabels
  # Calculating the ROC curve
  roc_curve <- roc(actual_labels, predicted_probs)
  # Getting the AUC value from model_evaluations list
  auc_value <- model_evaluations[[department]]$AUC
  # Plotting the ROC curve with AUC value in the title and adjusted size
roc_plot <- plot(roc_curve, main = paste("ROC Curve for", department, "(AUC =", round(auc_value, 2), ")"), width = width, height = height, cex.main = 0.8)
roc_plots[[department]] <- roc_plot
}
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls > cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

From the graph above, all of the models (from all, until 5 departments) shows relatively high number of AUC (>0.5), except the Electronics dept. The best model is all-dept model, fashion, and Sports-and-outdoors. Since the AUC is not small (very close to zero, <0.3), we can further proceed to evaluate the confusion matrix, comparing between true value and predicted value.

Step 9: Recalculating Evaluation Metrics

Accuracy measures the proportion of correctly predicted instances (both true positives and true negatives) out of all instances. While accuracy is straightforward, it can be misleading in imbalanced datasets where one class dominates. In such cases, accuracy alone may not provide a complete picture. Precision quantifies the proportion of true positive predictions out of all positive predictions (both true positives and false positives). High precision indicates that when the model predicts a positive instance, it is likely to be correct. It is crucial in scenarios where false positives are costly. Recall measures the proportion of true positive predictions out of all actual positive instances (both true positives and false negatives). High recall indicates that the model captures most of the actual positive instances. It is essential when missing positive cases (false negatives) is critical (e.g., disease diagnosis).The F1 score is the harmonic mean of precision and recall. It balances both metrics. F1 score combines precision and recall, providing a single metric that considers both false positives and false negatives. It is useful when you want to strike a balance between precision and recall.

metrics <- list()
evaluation_results <- list()

# Defining a function to calculate the confusion matrix
calculate_confusion_matrix <- function(predicted_labels, actual_labels) {
  TP_A <- sum(predicted_labels == 1 & actual_labels == 1)
  TN_A <- sum(predicted_labels == 0 & actual_labels == 0)
  FP_A <- sum(predicted_labels == 1 & actual_labels == 0)
  FN_A <- sum(predicted_labels == 0 & actual_labels == 1)
  
  confusion_matrix <- matrix(c(TP_A, FN_A, FP_A, TN_A), nrow = 2, byrow = TRUE,
                             dimnames = list(c("Actual Positive (1)", "Actual Negative (0)"),
                                             c("Predicted Positive", "Predicted Negative")))
  return(confusion_matrix)
}

# Looping through each department's model evaluations
for (department in names(model_evaluations)) {
  # Getting the predicted probabilities and actual labels for the current department
  predicted_probs <- model_evaluations[[department]]$PredictedProbabilities
  actual_labels <- model_evaluations[[department]]$ActualLabels
  
  # Calculating predicted labels based on a threshold (e.g., 0.5 for binary classification)
  predicted_labels <- ifelse(predicted_probs >= 0.5, 1, 0)
  
  # Make confusion matrix
  conf_matrix_A <- calculate_confusion_matrix(predicted_labels, actual_labels)
  
  # Calculating evaluation metrics
  accuracy_A <- sum(diag(conf_matrix_A)) / sum(conf_matrix_A)
  precision_A <- conf_matrix_A[1, 1] / sum(conf_matrix_A[1, ])
  recall_A <- conf_matrix_A[1, 1] / sum(conf_matrix_A[, 1])
  f1_score_A <- 2 * (precision_A * recall_A) / (precision_A + recall_A)
  
  # Storing the evaluation metrics and confusion matrix in the metrics list
  metrics[[department]] <- list(
    Accuracy = accuracy_A,
    Precision = precision_A,
    Recall = recall_A,
    F1_Score = f1_score_A,
    Confusion_Matrix = conf_matrix_A
  )
}
for (department in names(metrics)) {
  cat("Department:", department, "\n")
  cat("Accuracy:", metrics[[department]]$Accuracy, "\n")
  cat("Precision:", metrics[[department]]$Precision, "\n")
  cat("Recall:", metrics[[department]]$Recall, "\n")
  cat("F1 Score:", metrics[[department]]$F1_Score, "\n")
  cat("Confusion Matrix:\n")
  print(metrics[[department]]$Confusion_Matrix)
  cat("\n")
}
## Department: all 
## Accuracy: 0.9392842 
## Precision: 0.02891566 
## Recall: 0.4285714 
## F1 Score: 0.05417607 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                 12                403
## Actual Negative (0)                 16               6470
## 
## Department: Automotive 
## Accuracy: 0.9094203 
## Precision: 0.1304348 
## Recall: 0.375 
## F1 Score: 0.1935484 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                  3                 20
## Actual Negative (0)                  5                248
## 
## Department: Clothing, Shoes & Jewelry 
## Accuracy: 0.8871795 
## Precision: 0 
## Recall: NaN 
## F1 Score: NaN 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                  0                 22
## Actual Negative (0)                  0                173
## 
## Department: Electronics 
## Accuracy: 0.8778281 
## Precision: 0 
## Recall: NaN 
## F1 Score: NaN 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                  0                 27
## Actual Negative (0)                  0                194
## 
## Department: Fashion 
## Accuracy: 0.917603 
## Precision: 0.04347826 
## Recall: 1 
## F1 Score: 0.08333333 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                  1                 22
## Actual Negative (0)                  0                244
## 
## Department: Sports & Outdoors 
## Accuracy: 0.9097606 
## Precision: 0.1509434 
## Recall: 0.6666667 
## F1 Score: 0.2461538 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                  8                 45
## Actual Negative (0)                  4                486

Calculating the optimal threshold

By using ROC Curve to maximize recall, calculating the evaluation matrix, and creating the confusion matrix.

After examining the Confusion Matrix, we found that some of the inaccuracy especially Type II error (isBestSeller=“yes” predicted as “no”). We can still improve the model performance by defining Optimal Threshold of the logistics regression model. The concept is: by using ROC curve to find the optimum threshold of output probability that can maximize recall performance, prevent the Type II error. For example: we could not merely say for all of the model that threshold of probability >0.5 means that it must be istBestSeller=“yes”.

Because false negative (type II errors) are critical and costly in this business context, we prioritize recall. False Negative means that it should be Best Seller but the model predicts that it is not Best Seller, thus impacting the benefits of our clients.

# Looping through each department's ROC curve plots
for (department in names(roc_plots)) {
  # Get the ROC curve data from the roc_plots list
  roc_curve <- roc_plots[[department]]
  
  # Finding the optimal threshold for maximizing recall (sensitivity)
  optimal_threshold <- coords(roc_curve, "best", ret = "threshold", maximize = "sensitivity")$threshold
  
  # Getting the predicted probabilities and actual labels for the current department
  predicted_probs <- model_evaluations[[department]]$PredictedProbabilities
  actual_labels <- model_evaluations[[department]]$ActualLabels
  
  # Calculating predicted labels based on the optimal threshold
  predicted_labels <- ifelse(predicted_probs >= optimal_threshold, 1, 0)
  
  # Creating confusion matrix
  TP <- sum(predicted_labels == 1 & actual_labels == 1)
  FN <- sum(predicted_labels == 0 & actual_labels == 1)
  FP <- sum(predicted_labels == 1 & actual_labels == 0)
  TN <- sum(predicted_labels == 0 & actual_labels == 0)
  
  # Creating confusion matrix with labels
  confusion_matrix <- matrix(c(TP, FN, FP, TN), nrow = 2, byrow = TRUE)
  colnames(confusion_matrix) <- c("Predicted Positive", "Predicted Negative")
  rownames(confusion_matrix) <- c("Actual Positive (1)", "Actual Negative (0)")
  
  # Calculating evaluation metrics
  accuracy <- (TP + TN) / (TP + TN + FP + FN)
  precision <- TP / (TP + FP)
  recall <- TP / (TP + FN)
  f1_score <- 2 * (precision * recall) / (precision + recall)
  
  # Storing the evaluation results in the evaluation_results list
  evaluation_results[[department]] <- list(
    Threshold = optimal_threshold,
    ConfusionMatrix = confusion_matrix,
    Accuracy = accuracy,
    Precision = precision,
    Recall = recall,
    F1_Score = f1_score
  )
}
# Displaying the evaluation results for each department
for (department in names(evaluation_results)) {
  cat("Department:", department, "\n")
  cat("Optimal Threshold:", evaluation_results[[department]]$Threshold, "\n")
  cat("Confusion Matrix:\n")
  print(evaluation_results[[department]]$ConfusionMatrix)
  cat("Accuracy:", evaluation_results[[department]]$Accuracy, "\n")
  cat("Precision:", evaluation_results[[department]]$Precision, "\n")
  cat("Recall:", evaluation_results[[department]]$Recall, "\n")
  cat("F1 Score:", evaluation_results[[department]]$F1_Score, "\n\n")
}
## Department: all 
## Optimal Threshold: 0.07047192 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                246                169
## Actual Negative (0)               1642               4844
## Accuracy: 0.7375743 
## Precision: 0.1302966 
## Recall: 0.5927711 
## F1 Score: 0.2136344 
## 
## Department: Automotive 
## Optimal Threshold: 0.09968891 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                 15                  8
## Actual Negative (0)                 83                170
## Accuracy: 0.6702899 
## Precision: 0.1530612 
## Recall: 0.6521739 
## F1 Score: 0.2479339 
## 
## Department: Clothing, Shoes & Jewelry 
## Optimal Threshold: 0.1115103 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                 13                  9
## Actual Negative (0)                 54                119
## Accuracy: 0.6769231 
## Precision: 0.1940299 
## Recall: 0.5909091 
## F1 Score: 0.2921348 
## 
## Department: Electronics 
## Optimal Threshold: 0.0807731 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                 13                 14
## Actual Negative (0)                 98                 96
## Accuracy: 0.4932127 
## Precision: 0.1171171 
## Recall: 0.4814815 
## F1 Score: 0.1884058 
## 
## Department: Fashion 
## Optimal Threshold: 0.04083069 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                 20                  3
## Actual Negative (0)                112                132
## Accuracy: 0.5692884 
## Precision: 0.1515152 
## Recall: 0.8695652 
## F1 Score: 0.2580645 
## 
## Department: Sports & Outdoors 
## Optimal Threshold: 0.09590797 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                 38                 15
## Actual Negative (0)                160                330
## Accuracy: 0.6777164 
## Precision: 0.1919192 
## Recall: 0.7169811 
## F1 Score: 0.3027888

Displaying the evaluation metrics after using the optimal threshold to maximize recall

Finally, after getting optimal threshold, we will show the result of the Evaluation Metrics.

# Creating a data frame from evaluation_results
evaluation_df <- data.frame(
  Optimal_Threshold = sapply(evaluation_results, function(x) x$Threshold),
  Accuracy = sapply(evaluation_results, function(x) x$Accuracy),
  Precision = sapply(evaluation_results, function(x) x$Precision),
  Recall = sapply(evaluation_results, function(x) x$Recall),
  F1_Score = sapply(evaluation_results, function(x) x$F1_Score)
)

# Sorting the data frame by recall in descending order
evaluation_df <- evaluation_df[order(-evaluation_df$Recall),]

# Showing the sorted table
kable(evaluation_df, caption = "Evaluation Results Sorted by Recall (Descending)")
Evaluation Results Sorted by Recall (Descending)
Optimal_Threshold Accuracy Precision Recall F1_Score
Fashion 0.0408307 0.5692884 0.1515152 0.8695652 0.2580645
Sports & Outdoors 0.0959080 0.6777164 0.1919192 0.7169811 0.3027888
Automotive 0.0996889 0.6702899 0.1530612 0.6521739 0.2479339
all 0.0704719 0.7375743 0.1302966 0.5927711 0.2136344
Clothing, Shoes & Jewelry 0.1115103 0.6769231 0.1940299 0.5909091 0.2921348
Electronics 0.0807731 0.4932127 0.1171171 0.4814815 0.1884058

Part 4. Interpretation

4.1. Result Visualization

1. What are the number of best sellers, total product, and percentage of best seller per department?

# Create the plot for number of bestsellers and number of products together
plot1 <- ggplot(bestseller_counts, aes(x = num_bestsellers, y = reorder(department, num_bestsellers))) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label=num_bestsellers), vjust=0.3, hjust=-0.2, color="black")+
  annotate("text", y = "Home & Kitchen", x = 280, label = "  ", hjust = 1, color = "red") +
  labs(x = "Number of Bestsellers", y = "Department", title = "Number of Bestsellers per Department") +
  theme(plot.title = element_text(size = rel(0.8), hjust = 0.5))

plot1

plot2 <- ggplot(product_counts, aes(x = count, y = reorder(department, count))) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label=count), vjust=0.3, hjust=-0.2, color="black")+
  annotate("text", y = "Home & Kitchen", x = 4500, label = "  ", hjust = 1, color = "red") +
  labs(x = "Number of Products", y = "", title = "Number of Products per Department") +
  theme(plot.title = element_text(size = rel(0.8), hjust = 0.5))

plot2

# Arrange plots 1 and 2 together
#grid.arrange(plot1, plot2, nrow = 1)

# Create the plot for the highest percentage of bestsellers
plot3 <- ggplot(department_counts, aes(x = percentage_best_sellers, y = reorder(department, percentage_best_sellers))) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label=sprintf("%.2f%%", percentage_best_sellers)), vjust=0.3, hjust=-0.2, color="black")+
  annotate("text", y = "Sports & Outdoors", x = 13, label = "  ", hjust = 1, color = "red") +
  labs(x = "Percentage of Bestsellers (in %)", y = "", title = "Highest Percentage of Bestsellers per Department") +
  theme(plot.title = element_text(size = rel(0.8), hjust = 0.5))

# Display plot3 separately
plot3

2. What are the general insights specifically for top 5 departments based on the highest percentage of best seller product?

department_counts$color <- ifelse(row.names(department_counts) %in% head(row.names(department_counts), 5), "orange", "grey")

# Now create the plot
plot4 <- ggplot(department_counts, aes(x = percentage_best_sellers, y = reorder(department, percentage_best_sellers), fill = color)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("orange" = "orange", "grey" = "grey")) +
  geom_text(aes(label = sprintf("%.2f",percentage_best_sellers)), vjust = 0.3, hjust = -0.2, color = "black") +
  annotate("text", x = 15, y = "Sports & Outdoors", label = " ", hjust = 1, color = "red")+
  labs(x = "Percentage of Bestsellers (in %)", y = " ", title = "Highest Percentage of Bestsellers per Department") +
  theme(plot.title = element_text(size = rel(0.8), hjust = 0.5),legend.position="none")

# Display the plot
plot4

From the findings, the bar chart shows the distribution of the best seller percentage across all departments. The top 4 departments shows similar best seller effectiveness ratio that ranges between 10.94% and 9.78% (difference between Rank#1 and Rank#4 differs only by 1.16%). However, the margin between Rank#4 (9.78%) and Rank#5 (7.55%) is 2.23%, which we can consider as that Rank#5 belongs to a different “priority group”.

3. What are the general findings of the factors that affects a product to become a best seller product in Amazon? and What are the significant factors that influence a product to become a best seller product among the top 5 departments based on highest percentage of best seller product per department?

Result from Logistic Regression:

wide_df1
##                           Pr.value.discountPercentage_log_normalized
## all                                                     1.441405e-08
## Automotive                                              1.763222e-01
## Clothing, Shoes & Jewelry                               8.577209e-03
## Electronics                                             3.517189e-01
## Fashion                                                 9.472407e-01
## Sports & Outdoors                                       2.077137e-01
##                           Pr.value.price_log_normalized
## all                                        0.0003268504
## Automotive                                 0.1212109671
## Clothing, Shoes & Jewelry                  0.5287535560
## Electronics                                0.0577906673
## Fashion                                    0.0001630779
## Sports & Outdoors                          0.0271791352
##                           Pr.value.reviews_log_normalized
## all                                          1.888028e-52
## Automotive                                   8.660714e-01
## Clothing, Shoes & Jewelry                    2.104350e-02
## Electronics                                  3.052442e-03
## Fashion                                      9.135453e-02
## Sports & Outdoors                            3.163413e-01
##                           Pr.value.titleLength_normalized
## all                                             0.2905183
## Automotive                                      0.2852809
## Clothing, Shoes & Jewelry                       0.8681321
## Electronics                                     0.2693712
## Fashion                                         0.4884604
## Sports & Outdoors                               0.4825244
##                           Pr.value.stars_normalized Pr.value.boughtInLastMonth
## all                                       0.8125267               3.924224e-43
## Automotive                                0.2581895               8.933543e-10
## Clothing, Shoes & Jewelry                 0.3448242               1.652924e-01
## Electronics                               0.3785650               1.497172e-01
## Fashion                                   0.1576687               2.392216e-04
## Sports & Outdoors                         0.9670750               3.063728e-14
wide_df2
##                           Estimate.discountPercentage_log_normalized
## all                                                       1.45160120
## Automotive                                                1.34092732
## Clothing, Shoes & Jewelry                                 3.18179651
## Electronics                                              -1.12573547
## Fashion                                                  -0.07812276
## Sports & Outdoors                                         0.83632041
##                           Estimate.price_log_normalized
## all                                            1.079846
## Automotive                                     1.763517
## Clothing, Shoes & Jewelry                     -1.422806
## Electronics                                   -2.508848
## Fashion                                        7.225893
## Sports & Outdoors                              1.941748
##                           Estimate.reviews_log_normalized
## all                                             3.6287906
## Automotive                                      0.1684935
## Clothing, Shoes & Jewelry                       2.6031993
## Electronics                                     3.4306855
## Fashion                                         1.9314766
## Sports & Outdoors                               0.6212382
##                           Estimate.titleLength_normalized
## all                                             0.2618151
## Automotive                                      1.1606663
## Clothing, Shoes & Jewelry                       0.2949509
## Electronics                                    -1.4025858
## Fashion                                         1.0966877
## Sports & Outdoors                              -0.4996661
##                           Estimate.stars_normalized Estimate.boughtInLastMonth
## all                                      0.12225389               0.0006312080
## Automotive                               2.37393712               0.0046013741
## Clothing, Shoes & Jewelry                2.32003209               0.0006578427
## Electronics                             -2.33699805               0.0003702028
## Fashion                                  4.01013609               0.0014344048
## Sports & Outdoors                       -0.05421061               0.0034992797

As shown in the table above, by looking at the z.value coefficient (Estimate) and significant z.value (Pr.value), the general findings suggest that higher reviews and higher recently purchased products strongly influence a product to be a best seller product. Additionally, lower price and discounts moderately influence a product to be a best seller.

Consequently, the factors that influence the departments below are: (only top 1-3 most significant influencing factor)

Automotive: Recent purchase Clothing, Shoes & Jewelry: Discount, Reviews, Recent purchase (moderate) Electronics: Reviews, Recent purchase (Moderate) Fashion: Price, Recent purchase, Reviews (Moderate) Sports & Outdoors: Recent purchase, Price (Moderate)

Intersection: RandomForest Significance = RF; Logistic Regression Significance = Log

(RF | Log) If Significant = 1 If Not significant = 0

Department Discount Percentage Price Reviews Title Length Stars Recent Purchase
All (0 | 1) (0 | 1) (0 | 1) (0 | 0) (0 | 0) (1 | 1)
Automotive (0 | 0) (0 | 0) (0 | 0) (1 | 0) (1 | 0) (1 | 1)
Clothing (0 | 1) (1 | 0) (1 | 1) (0 | 0) (0 | 0) (1 | 0)
Electronics (1 | 0) (0 | 0) (1 | 1) (0 | 0) (1 | 0) (0 | 0)
Fashion (0 | 0) (1 | 1) (1 | 1) (0 | 0) (0 | 0) (1 | 0)
Sports (0 | 0) (1 | 1) (1 | 0) (1 | 0) (0 | 0) (1 | 1)

Final Remarks

In this research, two appropriate Modeling Technique: Classification (Random Forest) and Logistic Regression had been done. Both has been evaluated on its performance using Confusion Matrix, and the most importance variables were extracted in each of the 5 Departments with highest probability to be Best Sellers. Among the two methods, the Classification using Random Forest yields the highest accuracy. However, we still compare between two and to give recommendation, we will choose the variables that important in both model. For example: Random Forest yields that department X most important variables are Reviews and Price, while Logistic Regression yields that department Y most important variables are Price and BoughInLastMonth. But we will not solely choose RandomForest (and solely recommend to our seller to focus in Reviews and Price). The reasons are

  1. Consensus: Despite differences in gaining key (most important) variables, both Random Forest and Logistic Regression reinforces the importance of those variables in influencing the outcome of interest, whether it’s sales performance, customer satisfaction, or any other metric under consideration.

  2. By considering insights from multiple modeling techniques, we gain a more comprehensive understanding of the underlying relationships within the data.

  3. Adapt to Different Scenarios: Different modeling techniques may perform differently depending on the characteristics of the dataset and the specific problem at hand. By incorporating insights from multiple methods, we increase the adaptability of our recommendations to different scenarios and datasets, thereby improving their generalizability and applicability.

  4. We have not found a proof of why our Random Forest yields very good result using blackbox Random Forest library. We hypothesize that it yields accuracy >99% (almost ideal) because the same machine learning is used by Amazon. However, we have no connection to Amazon data team to proof this. *To make safe, we will use both of the methods, not ignoring one.**

4.2. Recommendations

Based on our findings focusing on the business problems we want to solve, we can derive two best immediate strategic action, further explain below:

Business problem 1: In Amazon e-commerce Canada, What products are recommended to sell?

Firstly, when our client is a new vendor that want to reach “Best Seller” products, we would recommend them to focus on the top four departments (namely Sport & Outdoor, Automotive, Clothing, Shoes & Jewelry, and Electronics) with the highest percentage of best seller products (see Section 4.1 #2 for more details). By allocating resources and learning from the strategies employed by these departments, vendors can increase the likelihood of producing best seller products.

Business problem 2: How to be Best Sellers in these Departments?

From Section 4.1 from the research question number 3, which is based on the Intersection of Classification (Random Forest) and Logistic Regression, for the company to showcase their products in Amazon, it is suggested to prioritize their effort and resource on the relevant influencing factors of each department from both methods result in order for their product to become a best seller product.

We recommend that company to focus their strategy on these most influence factor for each departments, which consists of: General (All): Increase number of recent purchase (to the level of average of each departmenet), we understand that each department might has its own characteristic and recent purchase could be the impact of the marketing effort (instead of a cause), thus we recommend for the sellers to follow the detail in each of the department below:

  • Automotive: Aim to sell more product to the amount of approximately 300 items per month (the number is derived from the average of boughtInLastMonth for best seller products in automotive department. In contrast, in average of non-best seller products were only sold for around 100 items per month.
  • Clothing, Shoes & Jewelry: Aim to gain more reviews for each product at the level of 7000 reviews (from very beginning of the product introduction). This can be done by approaching the customers through Amazon chat feature to fill in the reviews section, and making the QR code in the product delivered to get the customers directed to the review section.In contrast, in average of non best seller products within this department only gain 4390 reviews.
  • Electronics: Aims to gain more reviews for each product at the level of 11,000 reviews (from the very beginning of the product introduction). This also can be done by the approaches that we recommend at the previous point.
  • Fashion: We found that within this department, the Amazon’s customers prefer to buy fashion products with the higher price following with the higher number of reviews. The sellers should understand that, this result doesn’t mean that the sellers need to sell the products in a higher price but it must be followed by the product quality. The finding of this research shows that for fashion, people would likely to buy high quality fashion products.
  • Sports & Outdoors: We also found that within this department, the Amazon’s customers prefer to but sports and outdoors products in a higher price probably due to the brand recognition. It means people tend to buy well-known sports product.

Recent purchase or boughtInLastMonth has a lot of intersection from the two modeling methods. However, it is important to realize that recent purchase is an after effect of the other 5 variables. Therefore, we the effort that we can control fully are the 5 variables besides the boughtInLastMonth, which are discount percentage, price, reviews, stars, and title length as we could say as the independent variables. Meanwhile recent purchase or “BoughtInLastMonth” is considered as a dependent variable.

4.3. Further Improvement for Better Analytics and Robust Recommendations

Based on insights derived from this data analytics initiative, we present our recommendations as a 4-pillar e-commerce sales optimization process that will guide the consultancy service we want to offer to our clients. To maximize our clients’ product performance on Amazon Canada platform, we would recommend them to prioritize their efforts based on these four areas.

# Function to load image from Dropbox shared link
load_image <- function(dropbox_link) {
  # Modify the Dropbox shared link to obtain the direct link
  direct_link <- gsub("www.dropbox.com", "dl.dropboxusercontent.com", dropbox_link)
  direct_link <- gsub("\\?dl=0$", "", direct_link)
  
  # Fetch image from the direct link
  response <- GET(direct_link)
  
  # Read image as binary
  img_binary <- content(response, "raw")
  
  # Convert binary data to image
  img <- image_read(img_binary)
  
  return(img)
}

# Example Dropbox shared link to an image file
dropbox_link <- "https://www.dropbox.com/scl/fi/9c7dznlbg34yj130c3y97/4-Pillar-Sales-Optimization.jpeg?rlkey=ozgigv9alpp33cw1853otrbp3&st=qp3zlmyn&dl=0"

# Load image from Dropbox shared link
image <- load_image(dropbox_link)

# Display the loaded image
plot(as.raster(image))

Figure 1: Our Recommended 4-pillar Sales Optimization Process

  1. Strategic Category Selection

We would recommend the vendors to prioritizing selling a product under the department with the highest percentage of best seller product as mentioned before, which are “Sport & Outdoor”, “Automotive”, “Clothing, Shoes & Jewelry”, “Fashion”, and “Electronics”. Furthermore, it is also best to consider other factors such as highest selling growth and market trends that could affect the market dynamic of the product itself. Therefore, it can be the further actions that Amazon can consider.

  1. Product Listing Optimization

We would then recommend the vendors to ensure that their product listings are optimized for Amazon’s search ranking through Amazon SEO techniques. They can also experiment with A/B testing to compare different versions of product elements, such as title text, title length and product photos, to determine which drives more purchases. Utilizing engaging content for appealing branding and visual presentation can further improve product visibility.

  1. Smart Pricing Strategy Development

To achieve profitability, vendors should conduct pricing research to set competitive price points. So, we would recommend new vendors to take a more proactive approach to develop different pricing strategies that help them identify different break-even timeline on their product launch campaign. We can refer to the recommendations of the pricing scope that we mentioned before. Afterwards, they can use the Automate Pricing feature(https://sell.amazon.com/tools/automate-pricing) that can adjust prices dynamically based on market trends and competitor pricing.

  1. Positive Customer Engagement Customer engagement is crucial for building strong relationships and driving repeat business. We will recommend our clients to prioritize quick responses to pre-sale inquiries and following up after purchases, e.g. personal approach to get reviews, give promotion as customer retention program. We can also refered to the recommendations that we mentioned previously.

Firstly, when our client is a new vendor with focus to launch ‘Best Seller’-ranked products, we would focus on Pillar 1 of the process first while other steps could be addressed afterwards. By allocating resources and learning from the strategies employed by these departments, vendors can increase the likelihood of producing best seller products. Additionally, vendors competing in these categories should leverage factors that have shown significant influence on best sellers across departments, such as higher reviews recent purchases, and competitive pricing.

Secondly, for the clients that are existing vendors with focus on maximizing profitability, we would recommend them to focus on Pillar 2, 3, & 4. We would advise these clients to prioritize maintaining consistent quality standards to ensure customer loyalty and repeat purchases. Consistency in product quality across categories is key to maintaining customer satisfaction and positive reviews. Our data analytics initiative showed that customer reviews and direct engagement with potential and existing customers are a valuable resource to identify improvement areas for product quality. Therefore, significant focus should be placed on first responding to pre-sale questions, increasing the perception of high engagement on product listing and entice future customers to leave reviews. Using sentiment analysis, we would work with them to analyze customer reviews and feedback to identify areas for improvement in product quality and customer satisfaction.

Analyzing the performance of the best selling products has provided valuable insights strategies used by the product vendors. For example, in Electronic department the vendor can focus on the reviews section, while in Sports and Outdoors department the vendors can focus on price. So, if our clients are selling in poor-performing categories, we recommend them these strategies from the best selling categories and avoid practices that lead to low sales. They should also start allocating resources in customer service. Their lack of attention to after-sales and pre-sales enquiries can be reflected in the reviews and number of stars received by their products.

4.4. Conclusion & Limitations

Conclusion

Part 1 Project Background discusses about the main objectives, aims, and business problem we want to find. We also hypothesis that products with higher ratings and more reviews are likely to perform better in terms of sales. We set the main dataset that we will use and our strategy to approach the data for valuable insights.

Part 2 Data Exploration and Preparation explores and understand more about the data, especially what each variable contributes in. By looking on the dataset more thoroughly, we can know the challenges and requirements to overcome. Having a reliable input is essential. Therefore, methods and execution of creating clean data is conducted here. This results in a dataset that has been cleaned and ready used for the modelling part. From the feature engineering, we can obtain insight about the rank of best-seller products depending on its percentage of best seller product per department, number of best seller product per department, and number of products per department.

Part 3 Data Modeling is about using two modelling techniques to obtain the analytics needed for gaining the insights. We use classification using decision tree methods and logistic regression to gain information regarding the most influencial factors affecting a product in becoming best-seller product.

Part 4 Interpretation is about giving the summary and actionable recommendations for the users if they want to sell products in Amazon. The business problems is answered through section 4.2 Recommendations. The hypothesis that we define in Part 1 are partially true for certain departments. Most departments does become a best-seller due to more amount reviews. We don’t find any significant influence of ratings to a best-seller product.

Limitations

Since we used a publicly available data from Amazon, the reduced horizontal dimentionality of the dataset limited the depth of exploration and insights that we could extract for this assignment.

We did not have access to product description text for each product listing. We also did not perform semantic analysis for Product Title and image analysis on Product Image. Therefore, with these gaps combined, the recommendations could not include further details to guide the optimization of product listing page which could be a key factor in improving search result ranking and customer conversion. All of these could lead downstream to increase the likelihood of a product becoming Best Seller.

According to Amazon (https://sell.amazon.com/blog/amazon-best-sellers-rank), machine learning algorithm uses sales volume over time to generate the Best Seller ranking dynamically. Since this dataset only has single month of sales volume that was shown publicly, we could not validate if the sales number was reliable. Our classification and regression models had to use other publicly available variables as proxy predictors for Best Seller status.

Reference

[1] Reyes-Gómez, Juan & López Belbeze, Pilar & Josep, Rialp. (2024). The relationship between strategic orientations and firm performance and the role of innovation: a meta-analytic assessment of theoretical models. International Journal of Entrepreneurial Behavior & Research. 10.1108/IJEBR-02-2022-0200.

[2] International Trade Administration. (2023). Accessed via (https://www.trade.gov/country-commercial-guides/canada-ecommerce).

[3] Top 10 Users of E-Commerce: https://www.ecommerce-nation.com/top-10-countries-with-the-largest-e-commerce-industry/#:~:text=when%20shopping%20online.-,1.,ahead%20of%20any%20other%20country.

[4] Kaggle data in Amazon e-commerce: https://www.kaggle.com/datasets/asaniczka/amazon-canada-products-2023-2-1m-products/data. Credit to user “Asaniczka”.

[5] Laura J. Miller. (2000). The Best-Seller List as Marketing Tool and Historical Fiction. Link: https://muse.jhu.edu/article/3606.

[6] Farzad Fathi. (2023). How Does Best Seller Recommendation Shape the Ecosystem of an Online Marketplace?. Link: https://questromworld.bu.edu/platformstrategy/wp-content/uploads/sites/49/2023/06/PlatStrat2023_paper_30.pdf

[7] Liu, Qiong & Wu, Ying. (2012). Supervised Learning. 10.1007/978-1-4419-1428-6_451.

[8] John, G.H. (1995). Robust Decision Trees: Removing Outliers from Databases. Knowledge Discovery and Data Mining.

[9] Cieslak, D.A., Hoens, T.R., Chawla, N.V. et al. Hellinger distance decision trees are robust and skew-insensitive. Data Min Knowl Disc 24, 136–158 (2012). https://doi.org/10.1007/s10618-011-0222-1

[10] C50 Classification: https://cran.r-project.org/web/packages/C50/vignettes/C5.0.html

[11] Random Forest Fundamental: https://link.springer.com/article/10.1023/A:1010933404324 (Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324)

[12] Amazon Best Seller Definition: https://www.amazon.com/gp/help/customer/display.html?nodeId=GGGMZK378RQPATDJ

[13] Kranzlein, Michael. (2018). A Multiple Classifier System for Predicting Best-Selling Amazon Products.

[14] Sperandei, S. (2014). Understanding Logistic Regression Analysis.